Redshift UPDATE prohibitively slow

Question

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:

SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
  AND col2 = 12;

The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:

UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
  AND col2 = 12;

This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.

I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.

Any idea what is going on here?

See if you have an open transaction on that row. Try running SELECT * FROM pg_stat_activity; and see if your update is waiting. — Kuberchaun
You should provide your table and index definitions the version of postgres your on and any output of any commands you run. Do you have a trigger on the table? — Kuberchaun
Also add an explain plan for your select and your update more information is a good thing for these types of things. — Kuberchaun
I don't know anything about redshift, i took a quick look at the documentation, it looks like it is postgres 8.x with deviations from postgres, like unsupported commands, etc. I don't know if you have tried this. Maybe the update is moving the data from one node to another, and that is not working correctly. so, can you insert the data in sometable, and then delete the old data? a billion rows is a lot of rows. dang. — Greg

Joe Harris Joe Harris · Accepted Answer · 2014-09-23T15:16:52

You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:

Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table

If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.

You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.

Redshift UPDATE prohibitively slow

2 Answers