0
votes

I have two tables in Redshift - t1 and t2.

t2 already contains ~300 000 000 records.

t1 contains ~10 000 000 records.

I need to delete all records from t1 which are already present in t2 based on id field.

In order to do this, I'm going to execute the following queries(one of them):

DELETE FROM t1 WHERE id IN(SELECT id FROM t2);

or

DELETE FROM t1 USING t2 WHERE t1.id = t2.ud;

or

DELETE FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE  t1.id = t2.id);

Before I'll do it on real data, I'd like to ask - is it a good idea to use such queries in Redshift from performance point of view or there are some other (better) techicues there for such case?

1

1 Answers

1
votes

The biggest performance factor to doing this will be the distribution of t1 and t2, not the query construct. Are these tables key based distribution of the column id?

That being said I'd recommend that you go with the "delete using" construct. All of these queries join t1 and t2 together and doing this on the native tables is best. The Redshift query planner should see through the sub-selects in the other 2 queries but why run the risk of it missing an optimization.

Remember that when this complete a vacuum (delete only) / analyze would be recommended (at the end of all the transforms being run). Since both of these actions are now automated (if enabled) you should be covered but if you are adding data into t1 after the delete you will want a full vacuum at an appropriate point.