0
votes

My table is 500gb large with 8+ billion rows, INTERLEAVED SORTED by 4 keys. One of the keys has a big skew 680+. On running a VACUUM REINDEX, its taking very long, about 5 hours for every billion rows.

When i track the vacuum progress it says the following:

SELECT * FROM svv_vacuum_progress;
         table_name          |                                        status                                        | time_remaining_estimate 
-----------------------------+--------------------------------------------------------------------------------------+-------------------------
 my_table_name               | Vacuum my_table_name sort (partition: 1761 remaining rows: 7330776383)               | 0m 0s

I am wondering how long it will be before it finishes as it is not giving any time estimates as well. Its currently processing partition 1761... is it possible to know how many partitions there are in a certain table? Note these seem to be some storage level lower layer partitions within Redshift.

3

3 Answers

1
votes

If you vacuum is running slow you probably don’t have enough space on the cluster. I suggest you double the number of nodes temporarily while you do the vacuum.

You might also want to think about changing how your schema is set up. It’s worth going through this list of redshift tips to see if you can change anything: https://www.dativa.com/optimizing-amazon-redshift-predictive-data-analytics/

1
votes

These days, it is recommended that you should not use Interleaved Sorting.

The sort algorithm places a tremendous load on the VACUUM operation and the benefits of Interleaved Sorts are only applicable for very small use-cases.

I would recommend you change to a compound sort on the fields most commonly used in WHERE clauses.

The most efficient sorts are those involving date fields that are always incrementing. For example, imagine a situation where rows are added to the table with a transaction date. All new rows have a date greater than the previous rows. In this situation, a VACUUM is not actually required because the data is already sorted according to the Date field.

Also, please realise that 500 GB is actually a LOT of data. Doing anything that rearranges that amount of data will take time.

0
votes

The way we recovered back to the previous stage is to drop the table and restore it from the pre vacuum index time from the backup snapshot.