3
votes

I have a table with duplicate records. I want to remove them. I've created a column called "hash_code" which is just a sha1 hash of all the columns. Duplicate rows will have the same hash code. Everything is fine except when I tried to create a new table with a query containing GROUP BY. My table has RECORD data type, but the new table created flattens it even when I had specified it to not flatten. Seems like GROUP BY and the "-noflatten_results" flag doesn't place nice.

Here's an example command line I ran:

bq query --allow_large_results --destination_table mydataset.my_events --noflatten_results --replace 
"select hash_code, min(event) as event, min(properties.adgroup_name) as properties.adgroup_name, 
min(properties.adid) as properties.adid, min(properties.app_id) as properties.app_id,  
min(properties.campaign_name) as properties.campaign_name from mydataset.my_orig_events group each 
by hash_code "

In the above example, properties is a RECORD data type with nested fields. The resulting table doesn't have properties as RECORD data type. Instead it translated properties.adgroup_name to properties_adgroup_name, etc.

Any way to force BigQuery to treat the result set as RECORD and not flatten in GROUP BY?

Thanks!

1

1 Answers

2
votes

There are a few known cases where query results can be flattened despite requesting unflattened results.

  1. Queries containing a GROUP BY clause
  2. Queries containing an ORDER BY clause
  3. Selecting a nested field with a flat alias (e.g. SELECT record.record.field AS flat_field). Note that this only flattens the specific field with the alias applied, and only flattens the field if it and its parent records are non-repeated.

The BigQuery query engine always flattens query results in these cases. As far as I know, there is no workaround for this behavior, other than removing these clauses or aliases from the query.