1
votes

I created two tables with 43,547,563 rows each:

CREATE TABLE metrics_compressed (
    some_id bigint ENCODE ZSTD,
    some_value varchar(200) ENCODE ZSTD distkey,
    ...,
    some_timestamp bigint ENCODE ZSTD,
    ...,
    PRIMARY KEY (some_id, some_timestamp, some_value)
)
sortkey (some_id, some_timestamp);

The second one is exactly like the first one but without any column compressed.

Running this query (it just counts one row):

select count(*)
from metrics_compressed
where some_id = 20906 
and some_timestamp = 1475679898584;

shows a table scan of 42,394,071 rows (from the rows_pre_filter column in svl_query_summary, column is_rrscan true) and while running it over the uncompressed table it scans 3,143,856. I guess the reason for this is that the compressed one uses less 1MB blocks, hence the scan shows the total number of rows from the retrieved blocks.

Are the scanned rows a sign of bad performance? Or does Redshift use some kind of binary search within a block for such simple queries as this one, and the scanned rows is just confusing info for optimizing queries?

1
The first sign of a bad performance is the execution time of the queries, how do they differ, if they do at all? In general it is not recommended to compress sort keys, probably due to problem that you guessed about (more rows in a block, longer time to search within a block). - AlexYes
The query over the compressed table takes twice the time. We have some complex queries taking much longer, and they're also retrieving 3M to count a few rows. We just wonder if there's something else to optimize regarding compression or keys, but we need to understand how it handles retrieving so many rows by block. Maybe Redshift is better used to count millions of rows and not just a few. - Federico Pugnali
Then I think the answer is no, Redshift doesn't optimize the inter-block searches. Since it's a column storage that is just similar to sorting an array probably, where an array is a chunk of data for one column in a block. Compressing sort key is not a good practice since you'll be searching much larger array. - AlexYes
Can you please provide the output that shows it is scanning a different number of "rows"? Are you referring to the cost calculation in EXPLAIN TABLE output? More information please! - John Rotenstein
^^ I see that data from svl_query_summary table for that query. I updated the question. - Federico Pugnali

1 Answers

0
votes

In general, you should let Amazon Redshift determine its own compression types. It does this by loading 100,000 rows and determining the optimal compression type to use for each column based on this sample data. It then drops those rows and restarts the load. This happens automatically when a table is first loaded if there is no compression type specified on the columns.

The SORTKEY is more important for fast queries than compression, because it allows Redshift to totally skip over blocks that do not contain desired data. In your example, using some_id within the WHERE clause allows it to only look at blocks containing that specific value and since it is also the SORTKEY this will be extremely efficient.

Once a block is identified as potentially containing the SORTKEY data, Redshift will read the block from disk and process the contents.

The general rule is to use DISTKEY for columns most used in JOIN and use SORTKEY for columns most used in WHERE statements (but there are also more subtle variations on those general rules).