1
votes

I'm trying to select records from one table based on whether an item in a repeated field is located in a column in another table. I've been able to do this when explicitly listing the items I am testing for in my code, but not when selecting from another table. Let me demonstrate using the trigrams dataset:

Let's say that I wanted to select all records that had an appearance during certain years. But I don't just want just the data from those years - I want all the data associated with those records. If I just wanted the data for a few years I could do something like this (and this works):

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
    SOME(cell.value in ('1800', '1801')) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid

However, instead of coding '1800' and '1801' into my query, I have a table years that contains a set of years I am interested in. I would like this to work:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
    SOME(cell.value in (SELECT year_as_str FROM [mydataset.years])) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid

This doesn't work because bigquery requires that semi-joins be part of a WHERE or HAVING clause.

So I try to rearrange (back to the first query):

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802')) WITHIN RECORD

This results in an error Encountered " "WITHIN" "WITHIN "" ... Was expecting <EOF>

So now without the WITHIN RECORD:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802'))

This results in an error SELECT clause has mix of aggregations '...' and fields '...' without GROUP BY clause

But I'm not aggregating! So now I move the filter into a WHERE:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
WHERE SOME(cell.value in ('1801', '1802'))

And this tells me that Invalid function name: SOME. What?!

Is there a way to get the behavior I am looking for with BigQuery?

2

2 Answers

1
votes

Below solves your example and I hope you will be able to extend it to your real use case (if you will like solution)

SELECT 
    ngram, cell.value, cell.volume_count, 
    cell.volume_fraction, cell.page_count, cell.match_count
FROM [publicdata:samples.trigrams] AS trigrams
JOIN (
  SELECT ngram AS qualified
  FROM (
    FLATTEN((SELECT ngram, cell.value AS value
      FROM (FLATTEN([publicdata:samples.trigrams], cell.value))), value)
  ) AS t
  JOIN [mydataset.years] AS y
  ON y.year_as_str = t.value
  GROUP BY 1
) AS valid
ON valid.qualified = trigrams.ngram

Please pay attention to the fact that in [publicdata:samples.trigrams] the field cell.value is REPEATED STRING - that is why you see "extra" FLATTEN things

0
votes

You can use the OMIT RECORD IF clause for this. This may require double negation, since you need to omit records that meet some condition. The following query should work:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
OMIT RECORD IF EVERY(cell.value NOT IN ('1801', '1802'))