1
votes

I am trying to weed duplicates out of a table based on the column (that is created within the query) "alpha_ssc_key". However when I run this query, it just returns all of the results (instead of eliminating rows where the "alpha_ssc_key" is a duplicate). Any help would be greatly appreciated!

This is being done within BigQuery.

WITH ssc_test_view AS (
  SELECT
    DISTINCT CONCAT(CAST(date AS STRING), ciq_id, CAST(quantity AS STRING), CAST(cost_basis AS STRING),fund,security,class,inv_type,share_type) AS alpha_ssc_key,
    _metadata_created_at AS file_date,
    realized_gain_loss,
    cusip,
    acq_txn_no,
    acquisition_date,
    security,
    company,
    ticker,
    ciq_id,
    class,
    inv_type,
    dis_txn_no,
    quantity,
    categorization,
    transaction_type,
    cost_basis,
    share_type,
    fund,
    net_proceeds,
    unit_cost
  FROM
    `fcm-dw.acquisition_ssc.ssc_log`)
SELECT
  *
FROM
  ssc_test_view
2

2 Answers

1
votes

DISTINCT applied to ALL columns - not just to your key - that is why still all rows are returned. In your case better to group by fields from your compound key, but you need to decide on logic of actual deduping. In below example MIN is used. But you can use any aggregation function that fit in your needs

Try below

#standardSQL
WITH ssc_test_view AS (
  SELECT
    date,
    ciq_id, 
    quantity, 
    cost_basis,
    fund,
    security,
    class,
    inv_type,
    share_type,
    MIN(_metadata_created_at) AS file_date,
    MIN(realized_gain_loss) AS realized_gain_loss,
    MIN(cusip) AS cusip,
    MIN(acq_txn_no) AS acq_txn_no,
    MIN(acquisition_date) AS acquisition_date,
    MIN(company) AS company,
    MIN(ticker) AS ticker,
    MIN(dis_txn_no) AS dis_txn_no,
    MIN(categorization) AS categorization,
    MIN(transaction_type) AS transaction_type,
    MIN(net_proceeds) AS net_proceeds,
    MIN(unit_cost) AS unit_cost,
  FROM `fcm-dw.acquisition_ssc.ssc_log`
  GROUP BY 
    date,
    ciq_id, 
    quantity, 
    cost_basis,
    fund,
    security,
    class,
    inv_type,
    share_type)
SELECT * FROM ssc_test_view
1
votes

Try this:

SELECT stv.*
FROM (SELECT stv.*,
             ROW_NUMBER() OVER (PARTITION BY alpha_ssc_key ORDER BY _metadata_created_at DESC) as seqnum
      FROM ssc_test_view stv
     ) stv
WHERE seqnum = 1;