I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.
However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted? Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?
I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).
Official Answer by Redshift Support:
Here is what we say officially: http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html
When your table has a sort key defined, the table is divided into 2 regions:
- sorted, and
- unsorted
As long as you load data in sorted key order, even though the data is in the unsorted region, it is still in sort key order, so there is no need for VACUUM to ensure the data is sorted. A VACUUM is still needed to move the data from the unsorted region to the sorted region, however this is less critical as the data in the unsorted region is already sorted.
SORTKEYfor the table set to the column that contains the timestamp for each record? Have you examined the Min and Max values associated with the Zone Maps for each block to determine overlap? - John Rotenstein