Redshift: Is using a foreign key necessary to take advantage of distribution keys?

Question

In Amazon's guide, they mention specifying PRIMARY and FOREIGN KEYs for all of your tables, and then designating distribution keys where it makes sense, like on columns that often get used to join tables together. I understand that even with a single table query, the right DISTKEY specification would help in doing GROUP BY, but for JOINing two or more tables, do the DISTKEY columns have to be specified as FOREIGN KEYs as well? Or will Redshift co-locate rows from different tables to the same nodes based on the data-type (and maybe name) of columns used as the DISTKEY?

The reason I'm asking is because I'm not really using dimension tables in my application. I could create them simply to use as a foreign key reference to help with the distribution, but then the dimensions tables would have to be maintained.

Consider the following example where I have two tables that are frequently joined:

CREATE TABLE motorcycles
(
  id INT,
  hexcolor CHAR(6)
);

CREATE TABLE helmets
(
  id INT,
  hexcolor CHAR(6)
);

Now suppose in my application, we frequently join the motorcycles table to the helmets table on the hexcolor column. Then it would make sense to use DISTSTYLE KEY and use DISTKEY (hexcolor), right? However, you can't really say that the hexcolor column from the motorcycles table is a foreign key to the helmets table or vice-versa. I could create a dimension table that just had a list of all the possible hexcolor values, and then both the motorcycles and helmets tables could have a foreign key to this dimension table, but it would be a pain to have to maintain this dimension table (Amazon's guide also warns against specifying primary or foreign keys that are not properly maintained, because it will confuse the query planner).

So, with my motorcycles and helmets example, would a foreign key to a dimension table be necessary? Or will Redshift make an assumption that it should distribute the rows for both of these tables the same way based on the fact that the data type of the column used as the distribution key is the same?

Just adding an unrelated 2¢ - you should convert hexcolor to a 24/32-bit integer and store that. It'll be faster and take up less disk space. — ZiggyTheHamster
My real scenario has nothing to do with colors or motorcycles. I was just trying to give an example that would make sense to most anyone. :) — olanmills

Jonathan Jacobson Jonathan Jacobson · Accepted Answer · 2015-11-04T10:04:24

As long as the columns have the same data type, you should expect Redshift to distribute the motorcycles and helmets tables in the same fashion.

There is no justification for a foreign-key in your case. The query planner will be able to take advantage of the fact that the tables are distributed by the same key.

But it's always good to read the execution plan and make sure that it says DS_DIST_NONE - which means that no data redistribution was needed.

Redshift: Is using a foreign key necessary to take advantage of distribution keys?

1 Answers