Why Amazon Redshift UNLOAD performance is much better for fresh data?

Question

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?

Time the query spent executing. 39m 37.02s:

UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN 
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...

vs.

Time the query spent executing. 23.01s :

UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN 
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...

Thanks!

John Rotenstein John Rotenstein · Accepted Answer · 2017-06-27T00:40:17

Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).

If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.

Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

Why Amazon Redshift UNLOAD performance is much better for fresh data?

1 Answers