0
votes

I'm looking at the performance of some queries that I'm doing in Redshift and noticed something that I can't quite find in the documentation.

I created two tables that have a join key between them (about 10K rows in the child table).

For the parent table, let's call it A, I have a primary key that I've declared to be the distkey and sort key for the table. Let's call this id.

For the child table B, I've made a foreign key field, parent_id that references A.id. parent_id has been declared as the distkey for table B. Table B also has a primary key, id that I've defined. I've created an interleaved sort key on table B for (parent_id,id).

When I try to do an explain joining the two tables, I will always get a Hash Join. If I recreate table B with a normal compound sort key, I will always get a Merge Join.

When I look at the stats of the tables, I don't see any skews that are out of line.

My question is, will Redshift always use Hash Joins with interleaved sort keys or is there something I'm doing wrong?

EDIT - The order of the interleaved sort keys in Table B is actually (parent_id, id). I wrote it above incorrectly. I've updated the above to be clear now.

1
I suspect the order your interleaved key is causing the problem. Try putting parent_id first.Joe Harris
Thanks for the response. I actually had declared Table B using the parent_id first -- I just didn't write that above. I've made an edit to clarify since the merge join still doesn't happen.rchawdry
Hmm, definitely seems like it's a problem with INTERLEAVED then. I've had issues whenever I've tried it in the past (check the Redshift forum) so I've been staying away from it unless I really need it.Joe Harris
I tried to use interleaved sorting before in a case where it really should have helped, but the performance was an order of magnitude slower than standard compound sort. The theory is good but I assume it's not ready for prime time at this point.systemjack

1 Answers

2
votes

From my understanding:

  • A merge join can be used when both tables are sorted on the join column, which is very efficient -- a bit like closing a zipper, where both sides "fit into" each other.
  • A hash join is less efficient because it needs to do lookups via hashes to find matching values.

As you pointed out, if the tables are sorted using a normal compound key, then both tables are sorted by the join column.

In an interleaved join, however, values are not guaranteed to be sorted within each column.

The documentation for Interleaved Keys says:

An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to compound sorting.

However, it does not mean that all columns are sorted (as they are with a Compound sort). Rather, it gives a generally good mix of sorting, so that sorts on any column work generally well. Therefore, each column is not necessarily fully sorted, hence the need for a hash join.

The blog post Quickly Filter Data in Amazon Redshift Using Interleaved Sorting tries to explain how the data is stored when using interleaved sorting.