I'm having serious performance problems on Redshift and I've started to rethink my tables structures.
Right now, I'm identifying tables that have most significance on my dashboard. First of all, I run the following query:
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT s.tbl FROM stl_scan s
JOIN pg_user u ON u.usesysid = s.userid
WHERE s.type=2 AND u.usename='looker'
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;
Based on query result, I could identify a lot of small tables (1-1000 records) that are distributed as EVEN
and it could be ALL
- this tables are used in a lot of joins instructions.
Beside that, I've identified that 99% of my tables are using EVEN
without sort key. I'm not using denormalized tables so I need to run plenty of joins to get data - for what I've read, EVEN
is not good for joins because it could be distributed over the network.
I have 3 tables related to Ticket flow: user, ticket and ticket_history. All those tables are EVEN
without sort keys and diststyle as EVEN
.
For now, I would like to redesign table user
: this table is used on join by condition ticket.user_id = user.id
and where clauses like user.email = '[email protected]'
or user.email like '%@something.com%'
or group by user.email
.
First thing I'm planning to do is use diststyle as distribution and key as id
. Does make sense use a unique value as dist key? I've read plenty of posts about dist keys and still confuse for me.
As sort keys makes sense use email as compound? I've read to avoid columns that grows like dates, timestamps or identities, that's why i'm not using it as interleaved. To avoid that like
, I'm planning to create a new column to identify what is email domain.
After that, I'll change small tables to dist ALL
and try my queries again.
Am I on right way? Any other tip?
This question could sound stupid but my tech background is only software development, I'm learning about Redshift and reading a lot of documentations.