How to extract the year from a URL path using REGEXP_EXTRACT in Google Data Studio?

Question

I'm building out a Google Data Studio dashboard and I need to create a calculated field for the year a post was published. The year is in the URI path, but I'm not sure how to extract it using REGEXP_EXTRACT. I've tried a number of solutions proposed on here but none of them seem to work on Data Studio.

In short, I have a URI like this: /theme/2019/jan/blog-post-2019/

How do I use the REGEXP_EXTRACT function to get the first 2019 after theme/ and before /jan?

Don't have it because that's where I'm stuck. This is all I have: REGEXP_EXTRACT(Page, '(\d{4}[^/.*]*)') and it isn't working as it is capturing the second occurrence of the year as well and produces an error in Data Studio. — user1267332

horcrux horcrux · Accepted Answer · 2019-04-02T13:34:49

Try this:

REGEXP_EXTRACT(Page, 'theme\/([0-9]{4})\/[a-z]{3}\/')

where:

theme\/ means literally "theme/";
([0-9]{4}) is a capturing group containing 4 characters from 0 to 9 (i.e. four digits);
\/[a-z]{3}\/ means a slash, followed by 3 lowercase letters (supposing that you want the regex to match all the months), followed by another slash. If you want something more restrictive, try with \/(?:jan|feb|mar|...)\/ for the last part.

See demo.

How to extract the year from a URL path using REGEXP_EXTRACT in Google Data Studio?

2 Answers