Given a Google BigQuery dataset with col_1....col_m, how can you use Google BigQuery SQL to return the dataset where there are no duplicates in say... [col1, col3, col7] such that when there are rows with duplicates in [col1, col3, col7], then the first row among those duplicates is returned, and the rest of the rows which have duplicate fields in those columns are all removed?
Example: removeDuplicates([col1, col3])
col1 col2 col3
---- ---- ----
r1: 20 25 30
r2: 20 70 30
r3: 40 70 30
returns
col1 col2 col3
---- ---- ----
r1: 20 25 30
r3: 40 70 30
To do this using python pandas is easy. For a dataframe (i.e. matrix), you call the pandas function removedDuplicates([field1, field2, ...]). However, removeDuplicates is not specified within the context of Google Big Query SQL.
My best guess with how to do it in Google Big Query is to use the rank() function:
https://cloud.google.com/bigquery/query-reference#rank
I am looking for a concise solution if one exists.