1
votes

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.

I cannot use DISTSTYLE as ALL because it is a big table.

I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.

Columns on my table are

id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER

Our redshift cluster has 20 nodes.

So, I tried distribution key on a workday but the table is badly skewed.

There are 7 unique work days and 24 unique work hours.

How to avoid the skew in such cases?

How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?

3
We may be able to give better advice if you provide an example query.Joe Harris
I am using copy command to load the data from S3 to redshift.RohanB
"let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on" - If it's as you told there should be no skew. Redshift would make the distribution even by let's say placing 3m on one node and 1m+2m on another node, etc. There has to be a very significant outlier that causes the skew. Can you show the count by day and hour separately and top 20 of their combinations?AlexYes

3 Answers

2
votes

Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.

DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.

You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.

For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

1
votes

Here is the architecture that I recommend

1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table

2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.

3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.

its a bit complex to do this you need may need some professional help. There may be specifics to your use case.

0
votes

As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.

The following utilities will come in handy:

https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts

As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).

Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.

/**Queries on tables that are not utilizing SORT KEYs**/

SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;

/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name, 
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;

of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.

Hope this helps.