How to avoid skewing in redshift for Big Tables?

Question

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.

I cannot use DISTSTYLE as ALL because it is a big table.

I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.

Columns on my table are

id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER

Our redshift cluster has 20 nodes.

So, I tried distribution key on a workday but the table is badly skewed.

There are 7 unique work days and 24 unique work hours.

How to avoid the skew in such cases?

How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?

We may be able to give better advice if you provide an example query. — Joe Harris
I am using copy command to load the data from S3 to redshift. — RohanB
"let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on" - If it's as you told there should be no skew. Redshift would make the distribution even by let's say placing 3m on one node and 1m+2m on another node, etc. There has to be a very significant outlier that causes the skew. Can you show the count by day and hour separately and top 20 of their combinations? — AlexYes

demircioglu demircioglu · Accepted Answer · 2018-12-06T23:02:42

Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.

DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.

You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.

For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

How to avoid skewing in redshift for Big Tables?

3 Answers