Select only one row for each “ID” in a nested repeated schema in BigQuery

Question

I have a table in BigQuery with an ID field and a repeated record field along with some other fields like a data collection time.

There are multiple rows in this table for each ID and I want to somehow select/merge a single row for each ID. Almost every solution like selecting last, selecting first and aggregating rows with duplicate ID in one row are acceptable in my use case but I wasn't able to get any of them to work so far.

To be more precise my table has an ID field which in BigQuery terms is: {name: ID, type: STRING, mode: NULLABLE} and repeated field: {name: data, type: RECORD, mode: REPEATED} along with some other (plain) fields. In my table there are multiple rows for each ID that each one has a repeated field data for itself. In my query result, I want to have a table with exact same schema in which each ID appears only once and its corresponding data field is data field from one of occurrences of ID in the original table. (or ideally union from all its occurances)

Here is a list of solutions that don't work here:

First: Using

row_number() OVER (PARTITION BY ID ORDER BY collection_time) as rn ... where  rn=1

Cause: BigQuery flattens results when using partition by even if it Unflatten Results option is used.

Second: Selecting row with max/min collection time value:

Because: the value of the column is not unique for each id in my table due to some duplication in other parts of system.

Third: Using group by ID with nest/first on other fields.

Cause: using nest on the repeated record destroys the relation in the record field. For example SELECT ID, nest(data.a), nest(data.b) from:

ID     data.a      data.b
--------------------------
1      1a1          null
       1a2          1b2
--------------------------
1      2a1          2b1
       null         2b2

results in

ID      data.a       data.b
----------------------------
1        1a1         1b2
         1a2         2b1
         2a1         2b2

So you have different rows having the same ID and each row has repeated fields, and you want to merge the repeated structures into one structure right? — Pentium10
@Pentium10 Yes but I am flexible on the 'merge' so selecting just one row for each ID is also acceptable. — S.Mohsen sh
Let's define the cardinalities as it's confusing, so rows means full row as it was inserted eg the ones where ID is defined. Then repeated record means repeated, so it's more then 1. After your merge you want plain record only 1 entry, or you want all entries to be retained after the merge? — Pentium10
@Pentium10 I added a paragraph to the question, hope I was able to clearly describe the problem. — S.Mohsen sh
in your desired result - you mismatched data.a and data.b in such a way that initial pairs of a/b are not anymore preserved. Is there expected logic? so no hard relationship between a and b? — Mikhail Berlyant

Elliott Brossard Elliott Brossard · Accepted Answer · 2016-08-25T11:07:50

You'll have an easier time solving this using standard SQL (uncheck "Use Legacy SQL" under "Show Options"). You would use GROUP BY with ARRAY_CONCAT_AGG, e.g.:

SELECT id, ARRAY_CONCAT_AGG(data) AS data
FROM MyTable
GROUP BY id;

Select only one row for each “ID” in a nested repeated schema in BigQuery

2 Answers