0
votes

I am trying to replicate the Google Analytics UTM builder in Google Sheets. I have a formula I use elsewhere that works on just under 50% of the use cases but needs some additional regex adding to it.

=ARRAYFORMULA(IF(ISBLANK(A2:A),"",CONCAT(REGEXREPLACE(A2:A, "(/[^/]+)(?:/(?:\?|ref=)[^/]*)?$", "$1/"),"&utm_source=cpc&utm_medium=mysite")))

Here is a Google Sheet showing the examples and result in column D. The formula with regex in it is partially working on the supplied data. A copy can be made with File > Make a Copy.

Having unminified the JS file used on the Google Dev Tools page there appears to be a regex function that I think Google is using to perform the string builder, but the format is different to how it is used in Google Sheets.

f = function(e) {
            var t = e.replace(".", "\\.");
            return h("^(?:https?://)?(:?" + t + ")(?:$|[/?#])")
        },

If you have a look at rows 5 to 8 in the Google Sheet, it might just be a tweak needed to the existing formula to cope with adding the string to cells with file extensions and performing no action if a UTM string is already present.

EDIT The existing formula containing regex adds the string "&utm_source=cpc&utm_medium=mysite" to the end of the cells The list below shows the ORIGINAL url, with the REQUIRED result afterward

https://www.aol.com

https://www.aol.com/?utm_source=cpc&utm_medium=mysite

(Standard URL with no file extensions, so a / is added and the UTM string appending. STATUS: WORKING)


https://aol.com

https://aol.com/?utm_source=cpc&utm_medium=mysite

(Standard URL with no www or file extensions, so a / is added and the UTM string appending. STATUS: WORKING)


https://www.aol.com/

https://www.aol.com/?utm_source=cpc&utm_medium=mysite

(Standard URL with no file extensions, but a ending / is present, so this is not added and the UTM string appended. STATUS: WORKING)


https://www.aol.com/index.html

https://www.aol.com/index.html?utm_source=cpc&utm_medium=mysite

(File extensions exist, so by adding a / to the end along with the UTM string appended it breaks. STATUS: FAILING. It required no / on file extensions)


https://www.aol.com/index.php

https://www.aol.com/index.php?utm_source=cpc&utm_medium=mysite

(Another type of file extension exists(of which there will be hundreds), so by adding a / to the end along with the UTM string appended it breaks. STATUS: FAILING. It requires no / on file extensions)


http://www.asos.com/puma/puma-suede-classic-black-trainers/prd/3272954

http://www.asos.com/puma/puma-suede-classic-black-trainers/prd/3272954?utm_source=cpc&utm_medium=mysite

(This string uses a product ID, so by adding a / to the end along with the UTM string appended it breaks. STATUS: FAILING. It required no / on Product IDs)


http://www.asos.com/puma/puma-suede-classic-black-trainers/prd/3272954?utm_source=cpc&utm_medium=asos

http://www.asos.com/puma/puma-suede-classic-black-trainers/prd/3272954?utm_source=cpc&utm_medium=asos

(The UTM string already exists, so it gets added on again. STATUS: FAILING. If a UTM exists, we need to ignore it and leave the cell as it is)

1
Just FYI: It seems Google should use h("^(?:https?://)?(?:" + t + ")(?:$|[/?#])"), not h("^(?:https?://)?(:?" + t + ")(?:$|[/?#])"). The non-capturing group syntax is (?:....). Well, I do not get it why that code can help. Please put some strings with expexcted output in the question itself. - Wiktor Stribiżew
I am not 100% sure that is the calculation they use, but you can see it here on #1506 and nothing else seems to be close pastebin.com/XFp193PP - sigur7
Original url and required url added to question - sigur7
If you could formulate the rules it would be best. - Wiktor Stribiżew
Rules added, I think that is what you meant - sigur7

1 Answers

1
votes

Here we go:

^((?:^https?:\/\/[^\/\n]+))(\/?$|\/)((?:(?!\?utm_source=cpc).)*)$

substitue with: $1/$3?utm_source=cpc&utm_medium=mysite

The main ideas here are

  • to capture everything till the first directory separator (or an early end) as the domain part
  • throw the first / into the full match and rebuild it to get a normalized URL
  • capture the rest with a tempered greedy token that excludes utm_source=cpc

Demo


However, this does not work since Google Sheets does not support lookarounds. It's hard to craft this only with the basic re2 syntax. Anyways, I found a solution. It's ugly and should be improved but shows that it can be done:

=ARRAYFORMULA(IF(ISBLANK(A2:A),"",CONCAT(REGEXREPLACE(A2:A, "^((?:^https?:\/\/[^\/\n]+))(?:\/?$|\/)(.*?)(?:\?.*)?$", "$1/$2"),IF(REGEXREPLACE(A2:A, "^.*?(\?.*)?$", "$1")="", "&utm_source=cpc&utm_medium=mysite", REGEXREPLACE(A2:A, "^.*?(\?.*)?$", "$1")))))

The basic ideas here are:

  • rebuild a normalized URL (same first two steps above) until the ? query parameter shows up. add the fixed text if there is none, otherwise, leave as is.

Demo2

Apart from that, Google Sheets is not up to the job. Do yourself a favor and process this with another tool or a different approach.