Maintenance on Redshift

Question

I find myself dealing with a Redshift cluster with 2 different types of tables: the ones that get fully replaced every day and the ones that receive a merge every day.

From what I understand so far, there are maintenance commands that should be given since all these tables have millions of rows. The 3 commands I've found so far are:

vacuum table_name;
vacuum reindex table_name;
analyze table_name;

Which of those commands should be applied on which circumstance? I'm planning on doing it daily after they load in the middle of the night. The reason to do it daily is because after running some of these manually, there is a huge performance improvement.

After reading the documentation, I feel it's not very clear what the standard procedure should be.

Thanks!

EDIT: All the tables have interleaved sortkeys regardless of the load type.

John Rotenstein John Rotenstein · Accepted Answer · 2017-04-23T22:24:54

A quick summary of the commands, from the VACUUM documentation:

VACUUM: Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations. A full vacuum doesn't perform a reindex for interleaved tables.
VACUUM REINDEX: Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation.
ANALYZE: Updates table statistics for use by the query planner.

It is good practice to perform an ANALYZE when significant quantities of data have been loaded into a table. In fact, Amazon Redshift will automatically skip the analysis if less than 10% of data has changed, so there is little harm in running ANALYZE.

You mention that some tables get fully replaced every day. This should be done either by dropping and recreating the table, or by using TRUNCATE. Emptying a table with DELETE * is less efficient and should not be used to empty a table.

VACUUM can take significant time. In situations where data is being appended in time-order and the table's SORTKEY is based on the time, it is not necessary to vacuum the table. This is because the table is effectively sorted already. This, however, does not apply to interleaved sorts.

Interleaved sorts are more tricky. From the sort key documentation:

An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.

Basically, interleaved sorts use a fancy algorithm to sort the data so that queries based on any of the columns (individually or in combination) will minimize the number of data blocks that are required to be read from disk. Disk access always takes the most time in a database, so minimizing disk access is the best way to speed-up the database. Amazon Redshift uses Zone Maps to identify which blocks to read from disk and the best way to minimize such access is to sort data and then skip over as many blocks as possible when performing queries.

Interleaved sorts are less performant than normal sorts, but give the benefit that multiple fields are fairly well sorted. Only use interleaved sorts if you often query on many different fields. The overhead in maintaining an interleaved sort (via VACUUM REINDEX) is quite high and should only be done if the reindex effort is worth the result.

So, in summary:

ANALYZE after significant data changes
VACUUM regularly if you delete data from the table
VACUUM REINDEX if you use Interleaved Sorts and significant amounts of data have changed

Maintenance on Redshift

1 Answers