Redshift - Redesign tables to use DIST and SORT keys (performance issue)

Question

I'm having serious performance problems on Redshift and I've started to rethink my tables structures.

Right now, I'm identifying tables that have most significance on my dashboard. First of all, I run the following query:

SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
  SELECT DISTINCT s.tbl FROM stl_scan s
    JOIN pg_user u ON u.usesysid = s.userid
    WHERE s.type=2 AND u.usename='looker'
  )
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
  size DESC;

Based on query result, I could identify a lot of small tables (1-1000 records) that are distributed as EVEN and it could be ALL - this tables are used in a lot of joins instructions.

Beside that, I've identified that 99% of my tables are using EVEN without sort key. I'm not using denormalized tables so I need to run plenty of joins to get data - for what I've read, EVEN is not good for joins because it could be distributed over the network.

I have 3 tables related to Ticket flow: user, ticket and ticket_history. All those tables are EVEN without sort keys and diststyle as EVEN.

For now, I would like to redesign table user: this table is used on join by condition ticket.user_id = user.id and where clauses like user.email = '[email protected]' or user.email like '%@something.com%' or group by user.email.

First thing I'm planning to do is use diststyle as distribution and key as id. Does make sense use a unique value as dist key? I've read plenty of posts about dist keys and still confuse for me.

As sort keys makes sense use email as compound? I've read to avoid columns that grows like dates, timestamps or identities, that's why i'm not using it as interleaved. To avoid that like, I'm planning to create a new column to identify what is email domain.

After that, I'll change small tables to dist ALL and try my queries again.

Am I on right way? Any other tip?

This question could sound stupid but my tech background is only software development, I'm learning about Redshift and reading a lot of documentations.

John Rotenstein John Rotenstein · Accepted Answer · 2019-05-12T09:41:04

The basic rule of thumb is:

Set the DISTKEY to the column that is most used in JOINs
Set the SORTKEY to the column(s) most used in WHEREs

You are correct that small tables can have a distribution of ALL, which would avoid sending data between nodes.

DISTKEY provides the most benefit when tables are join via a common column that has the same DISTKEY in both tables. This means that each row is contained on the same node and no data needs to be sent between nodes (or, more accurately, slices). However, you can only select one DISTKEY, so do it on the column that is most often used for the JOIN.

SORTKEY provides the most benefit when Redshift can skip over blocks of storage. Each block of storage contains data for one column and is marked with a MIN and MAX value. When a table is sorted on a particular column, it minimises the number of disk blocks that contain data for a given column value (since they are all located together, rather than being spread randomly throughout disk storage). Thus, use column(s) that are most frequently used in WHERE statements.

If the user.email wildcard search is slow, you can certainly create a new column with the domain. Or, for even better performance, you could consider creating a separate lookup table with just user_id and domain, having SORTKEY = domain. This will perform the fastest when searching by domain.

A tip from experience: I would advise against using an email address as a user_id because people sometimes want to change email address. It is better to use a unique number for such id columns, with email address as a changeable attribute. (I've seen software systems need major rewrites to fix such an early design decision!)

Redshift - Redesign tables to use DIST and SORT keys (performance issue)

1 Answers