I have a table in BigQuery with an ID field and a repeated record field along with some other fields like a data collection time.
There are multiple rows in this table for each ID and I want to somehow select/merge a single row for each ID. Almost every solution like selecting last, selecting first and aggregating rows with duplicate ID in one row are acceptable in my use case but I wasn't able to get any of them to work so far.
To be more precise my table has an ID field which in BigQuery terms is:
{name: ID, type: STRING, mode: NULLABLE}
and repeated field: {name: data, type: RECORD, mode: REPEATED}
along with some other (plain) fields.
In my table there are multiple rows for each ID that each one has a repeated field data
for itself. In my query result, I want to have a table with exact same schema in which each ID appears only once and its corresponding data
field is data
field from one of occurrences of ID in the original table. (or ideally union from all its occurances)
Here is a list of solutions that don't work here:
First: Using
row_number() OVER (PARTITION BY ID ORDER BY collection_time) as rn ... where rn=1
Cause: BigQuery flattens results when using partition by
even if it Unflatten Results option is used.
Second: Selecting row with max/min collection time value:
Because: the value of the column is not unique for each id in my table due to some duplication in other parts of system.
Third: Using group by
ID with nest
/first
on other fields.
Cause: using nest
on the repeated record destroys the relation in the record field. For example SELECT ID, nest(data.a), nest(data.b)
from:
ID data.a data.b
--------------------------
1 1a1 null
1a2 1b2
--------------------------
1 2a1 2b1
null 2b2
results in
ID data.a data.b
----------------------------
1 1a1 1b2
1a2 2b1
2a1 2b2
ID
is defined. Thenrepeated record
means repeated, so it's more then 1. After your merge you want plain record only 1 entry, or you want all entries to be retained after the merge? – Pentium10