6
votes

I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:

SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
  AND col2 = 12;

The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:

UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
  AND col2 = 12;

This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.

I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.

Any idea what is going on here?

2
See if you have an open transaction on that row. Try running SELECT * FROM pg_stat_activity; and see if your update is waiting. - Kuberchaun
@Bob - just checked, doesn't look like anything is waiting. - user37760
You should provide your table and index definitions the version of postgres your on and any output of any commands you run. Do you have a trigger on the table? - Kuberchaun
Also add an explain plan for your select and your update more information is a good thing for these types of things. - Kuberchaun
I don't know anything about redshift, i took a quick look at the documentation, it looks like it is postgres 8.x with deviations from postgres, like unsupported commands, etc. I don't know if you have tried this. Maybe the update is moving the data from one node to another, and that is not working correctly. so, can you insert the data in sometable, and then delete the old data? a billion rows is a lot of rows. dang. - Greg

2 Answers

12
votes

You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:

  1. Each row that will be changed must be first marked for deletion
  2. Then a new version of the data must be written for each column in the table

If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.

You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.

0
votes

Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.

In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.