1
votes

I'm trying to query a google bigquery table using the regex from this blog post. Here it is, slightly modified:

pd\.([^”,\.\(\,\`) \’:\[\]\/\\={}]*)

regex101 example of its usage

It does not, however, work in my google bigquery python client SQL query:

query_results = client.run_sync_query(
"""
SELECT
  REGEXP_EXTRACT(SPLIT(content, '\n'),
                 r'pd\.([^”,\.\(\,\`) \’:\[\]\/\\={}]*)')
FROM
  [fh-bigquery:github_extracts.contents_py]
LIMIT 10
""")

query_results.run()

data = query_results.fetch_data()
data

BadRequest: BadRequest: 400 Failed to parse regular expression "pd.([^”,.(\,`) \’:[]/\={}]*)": invalid escape sequence: \’

1

1 Answers

2
votes

The problem here is that BigQuery uses re2 library for its regex operations.

If you try the same regex but using the golang flavor you will see the exact same error (golang also uses re2).

So maybe if you just remove the escaping of the ' character you'll already have it working for you (as I tested here it seemed to work properly).

Another issue that you might find is that the result of the SPLIT operation is an ARRAY. That means that BigQuery won't process your query saying that the signature of REGEXP_EXTRACT does not allow ARRAY<STRING> as input. You could use REGEXP_REPLACE instead:

"""
SELECT
  REGEXP_EXTRACT(REGEXP_REPLACE(content, r'.*(\\n)', ''),
                 r'pd\.([^”,\.\(\,\`) ’:\[\]\/\\={}]*)')
FROM
  [fh-bigquery:github_extracts.contents_py]
LIMIT 10
"""

The character "\n" is replaced by "" in this operation and the result is a STRING.