0
votes

Given an arbitrary list of phrases phrase1, phrase2*, ... phraseN (say these are in another table Phrase_Table), how would one get the count of matches for each phrase in a field F in a bigquery table?

Here, "*" means there must be some non-empty/non-blank string after the phrase.

Lets say you have a table with and ID field and two string fields Field1, Field2

Output would look something like

id, CountOfPhrase1InField1, CountOfPhrase2InField1, CountOfPhrase1InField2, CountOfPhrase2InField2

or I guess instead of all of those output fields maybe there's a single json object field

id, [{"fieldName": Field1, "counts": {phrase1: m, phrase2: mm, ...}, {"fieldName": Field2, "counts": {phrase1: m2, phrase2: mm2, ...},...]

Thanks!

1

1 Answers

1
votes

Below example is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test'
), `project.dataset.keywords` AS (
  SELECT 'foo' key UNION ALL
  SELECT 'test'
)
SELECT str, ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) as matches)) all_matches
FROM `project.dataset.table` 
CROSS JOIN `project.dataset.keywords`
GROUP BY str

with result

Row str                     all_matches.key all_matches.matches  
1   foo1 foo foo40          foo             2    
                            test            0    
2   test1 test test2 test   foo             0    
                            test            2    

If you prefer output as json you can add TO_JSON_STRING() as in below example

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test'
), `project.dataset.keywords` AS (
  SELECT 'foo' key UNION ALL
  SELECT 'test'
)
SELECT str, TO_JSON_STRING(ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) as matches))) all_matches
FROM `project.dataset.table` 
CROSS JOIN `project.dataset.keywords`
GROUP BY str

with output

Row str                     all_matches  
1   foo1 foo foo40          [{"key":"foo","matches":2},{"key":"test","matches":0}]   
2   test1 test test2 test   [{"key":"foo","matches":0},{"key":"test","matches":2}]     

there are endless ways of presenting outputs like above - hope you will adjust it to whatever exactly you need :o)