In Amazon's guide, they mention specifying PRIMARY and FOREIGN KEYs for all of your tables, and then designating distribution keys where it makes sense, like on columns that often get used to join tables together. I understand that even with a single table query, the right DISTKEY specification would help in doing GROUP BY, but for JOINing two or more tables, do the DISTKEY columns have to be specified as FOREIGN KEYs as well? Or will Redshift co-locate rows from different tables to the same nodes based on the data-type (and maybe name) of columns used as the DISTKEY?
The reason I'm asking is because I'm not really using dimension tables in my application. I could create them simply to use as a foreign key reference to help with the distribution, but then the dimensions tables would have to be maintained.
Consider the following example where I have two tables that are frequently joined:
CREATE TABLE motorcycles
(
id INT,
hexcolor CHAR(6)
);
CREATE TABLE helmets
(
id INT,
hexcolor CHAR(6)
);
Now suppose in my application, we frequently join the motorcycles table to the helmets table on the hexcolor column. Then it would make sense to use DISTSTYLE KEY
and use DISTKEY (hexcolor)
, right? However, you can't really say that the hexcolor column from the motorcycles table is a foreign key to the helmets table or vice-versa. I could create a dimension table that just had a list of all the possible hexcolor values, and then both the motorcycles and helmets tables could have a foreign key to this dimension table, but it would be a pain to have to maintain this dimension table (Amazon's guide also warns against specifying primary or foreign keys that are not properly maintained, because it will confuse the query planner).
So, with my motorcycles and helmets example, would a foreign key to a dimension table be necessary? Or will Redshift make an assumption that it should distribute the rows for both of these tables the same way based on the fact that the data type of the column used as the distribution key is the same?