1
votes

When does BigQuery flatten an intermediate result set? I was under the impression that it was only when FLATTEN was invoked, but I've encountered an example where the result is flattened without a FLATTEN.

This is the case - this base query returns one record:

select count(*) from publicdata:samples.trigrams 
where ngram = 'der Griindung im'
+-----+
| f0_ |
+-----+
|   1 |
+-----+

When queried, you can see that the record has a repeated field that is repeated twice.

select * from publicdata:samples.trigrams 
where ngram = 'der Griindung im'
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
|      ngram       | first |  second   | third | fourth | fifth | cell_value | cell_volume_count | cell_volume_fraction | cell_page_count | cell_match_count | cell_sample_id | cell_sample_text | cell_sample_title | cell_sample_subtitle | cell_sample_authors | cell_sample_url |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| der Griindung im | der   | Griindung | im    | NULL   | NULL  | 2007       |                54 | 0.008746355685131196 |              54 |               54 | NULL           | NULL             | NULL              | NULL                 | NULL                | NULL            |
| der Griindung im | der   | Griindung | im    | NULL   | NULL  | 2008       |                47 | 0.007612568837058633 |              47 |               47 | NULL           | NULL             | NULL              | NULL                 | NULL                | NULL            |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+

When I add a filter on cell.value, I get two records instead of one - but I never flattened so I'm not sure about the behavior here. My expectation is that this would return the same output as the previous COUNT above. It doesn't:

select count(*) from publicdata:samples.trigrams 
where ngram = 'der Griindung im' and cell.value in ('2007', '2008')
+-----+
| f0_ |
+-----+
|   2 |
+-----+

What this means is that while I expect select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and cell.value in ('2007', '2008') to return the same output, they don't because one is implicitly flattened and the other is not. While this may not seem like a huge issue, this could matter significantly if it was part of a nested query that expected an intermediate result to be flattened or repeated.

Under what conditions does BigQuery flatten results without an explicit FLATTEN?

2

2 Answers

1
votes

Let me answer first, how to get correct count in this case:

So instead of

SELECT COUNT(*) 
FROM [publicdata:samples.trigrams] 
WHERE ngram = 'der Griindung im' 
AND cell.value IN ('2007', '2008')

with result of

+-----+
| f0_ |
+-----+
|   2 |
+-----+

you should do

SELECT COUNT(*) 
FROM [publicdata:samples.trigrams] 
WHERE ngram = 'der Griindung im' 
OMIT RECORD IF EVERY(cell.value NOT IN ('2007', '2008'))

with result of

+-----+
| f0_ |
+-----+
|   1 |
+-----+

as I think what you expected

Secondly - Under what conditions does BigQuery flatten results without an explicit FLATTEN?

I think (just my guess baseed on BQ behavior observation) every time you explicitelly reference record's field within clauses like SELECT or WHERE , it gets automatically flattened for you. Using FLATTEN operator helps "control" this process.

1
votes

Short story: use count(0) instead of count(*). (You get 1 instead of 2.)

count(*) behaves strangely with repeated fields. It looks like the results are flattened, but if that were really the case, this should also affect count(0). I've asked about this here, but I haven't so far received a full explanation.