Google BigQuery - simulate Pandas removeDuplicates() in Google BigQuery SQL

Question

Given a Google BigQuery dataset with col_1....col_m, how can you use Google BigQuery SQL to return the dataset where there are no duplicates in say... [col1, col3, col7] such that when there are rows with duplicates in [col1, col3, col7], then the first row among those duplicates is returned, and the rest of the rows which have duplicate fields in those columns are all removed?

Example: removeDuplicates([col1, col3])

    col1 col2 col3
    ---- ---- ----
r1: 20   25   30
r2: 20   70   30
r3: 40   70   30

returns

  col1 col2 col3
  ---- ---- ----
r1: 20   25   30
r3: 40   70   30

To do this using python pandas is easy. For a dataframe (i.e. matrix), you call the pandas function removedDuplicates([field1, field2, ...]). However, removeDuplicates is not specified within the context of Google Big Query SQL.

My best guess with how to do it in Google Big Query is to use the rank() function:

https://cloud.google.com/bigquery/query-reference#rank

I am looking for a concise solution if one exists.

Jordan Tigani Jordan Tigani · Accepted Answer · 2015-02-25T03:37:15

You can group by all of your columns that you want to remove duplicates from, and use FIRST() of the others. That is, removeDuplicates([col1, col3]) would translate to

SELECT col1, FIRST(col2) as col2, col3 
FROM table 
GROUP EACH BY col1, col3

Note that in BigQuery SQL, if you have more than a million distinct values for col1 and col3, you'll need the EACH keyword.

Google BigQuery - simulate Pandas removeDuplicates() in Google BigQuery SQL

1 Answers