I have a Redshift table with page hits, like so
CREATE TABLE hits
(
user_id INT,
ts TIMESTAMP,
page VARCHAR(255)
)
SORTKEY(user_id, ts)
DISTKEY(user_id);
Since I'll be running a bunch of window functions over user_id, I thought it would be a good idea to distribute the table by user_id so nodes don't have to exchange data on users before being able to execute the query.
But the users are only ever active for some time and are numbered sequentially. user_id and time are therefore correlated so whenever I run a query that subsets by time (ts) this will lead to skew if Redshift also distributes by user_id sequentially. This would be less of a problem if it distributed by the DISTKEY randomly. My question is: does it?
(I'm new to Redshift so all of this may just be a total misunderstanding of how things work in general. In that case, apologies in advance!)