Using Flatten to select where var1 (non-repeated) = “abc” from a bigquery table which contains multiple nested variables?

Question

I'm fairly new to BigQuery (3rd day of using it with no training), I'm just trying to get my head around nested fields etc.

I've looked at the following resources and used the personsdata example on the google bigquery docs link

https://cloud.google.com/bigquery/docs/data

https://chartio.com/resources/tutorials/how-to-flatten-data-using-google-bigquerys-legacy-vs-standard-sql/

I'd like to run the below query:

select *
from [dataset.tableid]
where fullname = 'John Doe'

If I run this, I get the following error:

Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place

From reading the above articles this isn't possible because you need to flatten the results, which from what I can understand just duplicates all the none repeated variables i.e.

Fullname | age | gender | Children.name | children.age

John Doe | 22 | Male | John | 5

John Doe | 22 | Male | Jane | 7

One of the above articles suggests that you can still use the where statements by using the flatten function in bigquery:

select fullname,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place

If I change this to:

select *
FROM (FLATTEN([dataset.tableId], children))
WHERE fullname = 'John Doe'

Then this works fine and gives me what I need however if I change to this:

select *
FROM (FLATTEN([dataset.tableId], citieslived))
WHERE fullname = 'John Doe'

Then I get the following error:

Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_yearsLived

Can someone explain why this will work flattening based on "Children" but not "CitiesLived" and how to know what variables to use within flatten with more complex datasets with multiple nested variables?

Thank you in advance

Is there a reason to try to do this using legacy SQL? I would suggest using standard SQL instead, since it makes array-based transformations easier to express. — Elliott Brossard
Hi Elliott, thanks for your quick response, sorry, I wasn't even aware than bigquery had two SQL dialects. Let me have a look into this to see about migrating and why we use the legacy language. If you have any pointers on using the standard SQL then please let me know. Thank you — Thomas Chamberlain

Mikhail Berlyant Mikhail Berlyant · Accepted Answer · 2017-02-03T15:36:28

Can someone explain why this will work flattening based on "Children" but not "CitiesLived"

Check schema of this table again

                Schema                
  ----------------------------------- 
   |- kind: STRING                    
   |- fullName: STRING (required)
   |- age: INTEGER
   |- gender: STRING
   +- phoneNumber: RECORD
   |  |- areaCode: INTEGER
   |  |- number: INTEGER
   +- children: RECORD (repeated)
   |  |- name: STRING
   |  |- gender: STRING
   |  |- age: INTEGER
   +- citiesLived: RECORD (repeated)
   |  |- place: STRING
   |  +- yearsLived: INTEGER (repeated)

As you can see - when you flatten children repeated record – the only repeated record that is left for output is citiesLived and even though it has inside it yet another repeated field – yearsLived – they are not independent – thus BigQuery Legacy SQL can output result

Now, when you flatten by citiesLived – what you get in result are two repeated fileds - children and yearsLived. Those two are independent - thus BigQuery Legacy SQL cannot output such result.

how to know what variables to use within flatten with more complex datasets with multiple nested variables?

To make it work - you should add yet another flattening with (for example) yearsLived filed. Something like below

FROM (FLATTEN(FLATTEN([dataset.tableId], citieslived), yearsLived))

Adding all those multiple FLATTENs can become cumbersome so using BigQuery Standard SQL is really the way to go!

See Migrating from Legacy SQL to BigQuery Standard SQL

Using Flatten to select where var1 (non-repeated) = “abc” from a bigquery table which contains multiple nested variables?

2 Answers