1
votes

We have customer data that is sharded by a company ID. That is, no companies data would ever mix with another companies data so this was chosen as the distkey.

Should the company ID be the first column in the sortkey given that a node may contain several thousand companies? Or does the distkey already limit the data to a given company before it starts scanning?

1
the distkey should be a performance choice. Using it to put on different shards doesn't do anything. It's transparent. Its not like company X can connect to just shard Y and look at their data. - Nick.McDermaid
I understand that but it doesn't really answer the question either way. If I perform the query: SELECT COUNT(*) FROM sales WHERE company_id = 123 it will know which node to run the query, but then will it need to scan the whole node to find the records (so it should be in the sortkey), or is the data segmented on the node into individual company_id's (sortkey is not needed)? - Elliot Chance

1 Answers

1
votes

Dist key does not affect the order in which rows are stored in each node/slice/block. Sort key (or natural order in the absence of such) defines the order.

If you expect frequent queries with company_id and you want to achieve maximum performance, make company_id the main sort key (COMPOUND or default, not just INTERLEAVED).

I'd also advise familiarising yourself with the SVL_QUERY_REPORT view. It can tell you whether full-scan was used (or range-restricted when using optimal sort keys), against which slices, and how many rows were actually scanned. Try different table layouts for the same data, and not only look at query times, but also confirm from this report that Redshift does what you expect it to do.