0
votes

I'm building out a Google Data Studio dashboard and I need to create a calculated field for the year a post was published. The year is in the URI path, but I'm not sure how to extract it using REGEXP_EXTRACT. I've tried a number of solutions proposed on here but none of them seem to work on Data Studio.

In short, I have a URI like this: /theme/2019/jan/blog-post-2019/

How do I use the REGEXP_EXTRACT function to get the first 2019 after theme/ and before /jan?

2
Please share the current code to see where you are stuck.Wiktor Stribiżew
Don't have it because that's where I'm stuck. This is all I have: REGEXP_EXTRACT(Page, '(\d{4}[^/.*]*)') and it isn't working as it is capturing the second occurrence of the year as well and produces an error in Data Studio.user1267332
Try REGEXP_EXTRACT(Page, '/([0-9]{4})/')Wiktor Stribiżew

2 Answers

1
votes

Try this:

REGEXP_EXTRACT(Page, 'theme\/([0-9]{4})\/[a-z]{3}\/') 

where:

  • theme\/ means literally "theme/";
  • ([0-9]{4}) is a capturing group containing 4 characters from 0 to 9 (i.e. four digits);
  • \/[a-z]{3}\/ means a slash, followed by 3 lowercase letters (supposing that you want the regex to match all the months), followed by another slash. If you want something more restrictive, try with \/(?:jan|feb|mar|...)\/ for the last part.

See demo.

1
votes

As you mentioned, I think you only want to extract the year between the string. The following will achieve that for you.

fit the query as per your needs

 SELECT * 
 FROM Sample_table
 WHERE REGEXP_EXTRACT(url, "(?<=\/theme\/)(?<year>\d{4})(?=\/[a-zA-Z]{3})")