Is it ok to define nullable fields as sortkey and distkey? I don't expect to have many nulls, but it may happen. Thanks
2 Answers
While Redshift would work with nullable DIST key, you have to be careful with that.
If many rows have the same value (whether it is NULL or some other value) in the DIST key column, all those rows would end up on the same node. So it is recommended to use the column which provides uniform distribution of values, as the dist key, that is the frequency of any value is similar to any other value.
When column is nullable, often the NULL value is skewed in terms of frequency comparing to other values. This will result in uneven distribution of load between nodes, and in case of x1.large nodes, you may also quickly run out of disk space on that node.
So, I'd recommend picking a column which does not allow NULLs, though obviously it also needs to logically match with pattern of your joins. If you don't have joins, I'd rather recommend using EVEN dist style in that case, to ensure the distribution of load is even.