3
votes

I'm fairly new to BigQuery (3rd day of using it with no training), I'm just trying to get my head around nested fields etc.

I've looked at the following resources and used the personsdata example on the google bigquery docs link

https://cloud.google.com/bigquery/docs/data

https://chartio.com/resources/tutorials/how-to-flatten-data-using-google-bigquerys-legacy-vs-standard-sql/

I'd like to run the below query:

select *
from [dataset.tableid]
where fullname = 'John Doe'

If I run this, I get the following error:

Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place

From reading the above articles this isn't possible because you need to flatten the results, which from what I can understand just duplicates all the none repeated variables i.e.

Fullname | age | gender | Children.name | children.age

John Doe | 22 | Male | John | 5

John Doe | 22 | Male | Jane | 7

One of the above articles suggests that you can still use the where statements by using the flatten function in bigquery:

select fullname,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place

If I change this to:

select *
FROM (FLATTEN([dataset.tableId], children))
WHERE fullname = 'John Doe'

Then this works fine and gives me what I need however if I change to this:

select *
FROM (FLATTEN([dataset.tableId], citieslived))
WHERE fullname = 'John Doe'

Then I get the following error:

Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_yearsLived

Can someone explain why this will work flattening based on "Children" but not "CitiesLived" and how to know what variables to use within flatten with more complex datasets with multiple nested variables?

Thank you in advance

2
Is there a reason to try to do this using legacy SQL? I would suggest using standard SQL instead, since it makes array-based transformations easier to express. - Elliott Brossard
Hi Elliott, thanks for your quick response, sorry, I wasn't even aware than bigquery had two SQL dialects. Let me have a look into this to see about migrating and why we use the legacy language. If you have any pointers on using the standard SQL then please let me know. Thank you - Thomas Chamberlain

2 Answers

1
votes

Can someone explain why this will work flattening based on "Children" but not "CitiesLived"

Check schema of this table again

                Schema                
  ----------------------------------- 
   |- kind: STRING                    
   |- fullName: STRING (required)
   |- age: INTEGER
   |- gender: STRING
   +- phoneNumber: RECORD
   |  |- areaCode: INTEGER
   |  |- number: INTEGER
   +- children: RECORD (repeated)
   |  |- name: STRING
   |  |- gender: STRING
   |  |- age: INTEGER
   +- citiesLived: RECORD (repeated)
   |  |- place: STRING
   |  +- yearsLived: INTEGER (repeated)

As you can see - when you flatten children repeated record – the only repeated record that is left for output is citiesLived and even though it has inside it yet another repeated field – yearsLived – they are not independent – thus BigQuery Legacy SQL can output result

Now, when you flatten by citiesLived – what you get in result are two repeated fileds - children and yearsLived. Those two are independent - thus BigQuery Legacy SQL cannot output such result.

how to know what variables to use within flatten with more complex datasets with multiple nested variables?

To make it work - you should add yet another flattening with (for example) yearsLived filed. Something like below

FROM (FLATTEN(FLATTEN([dataset.tableId], citieslived), yearsLived))  

Adding all those multiple FLATTENs can become cumbersome so using BigQuery Standard SQL is really the way to go!

See Migrating from Legacy SQL to BigQuery Standard SQL

0
votes

If you run this query:

SELECT
*
FROM
(FLATTEN((FLATTEN(([project_id:dataset_id.table]), citiesLived.yearsLived)), citiesLived))

It will flatten as expected.

When using the Legacy SQL, BQ tries to flatten automatically the results for you.

What I have noticed though is that if you try to flatten repeated fields that have other repeated fields inside then sometimes you might run into these errors (notice that the fields citiesLived and citiesLived.yearsLived are both repeated).

So one way to solve that is by forcing the flatten operation on all repeated fields you want to work with (in the example I showed you I first flattened the yearsLived and then citiesLived) and not relying on the automatic flattening operation that the Legacy SQL offers.

But what I strongly recommend and encourage you to do is to learn the Standard SQL version for BQ as Elliot suggested in his comment. It might have a steeper learning curve at first but it will totally pay off in the long run (and you won't have the risk of eventually having to migrate all your legacy queries to standard as we had to do in our company)