We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.
SELECT
* EXCEPT (row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number
FROM rawData.stock_movement
)
WHERE row_number = 1
Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.
Is it the performance to remove duplicate record not stable?