0
votes

I have a ultra wide table with 100s of column in redshift. This table is more like a matrix. One row per user with 1 or 0 in each column depending on certain features of the user. the goal is obtain best response time on select SQL from this table. We are expecting about 160 M total rows in the table and the select will return about 30M rows. Select is usually based in one of the columns being 1.

the table looks like user_id, col1, col2.....col100

for example:

select * from table 1 where col10=1;

Then with this result set the UI will figure intersecting users between multiple columns. Currently the performance is very bad. What distkey or sortkey options that will improve the query performance?

1

1 Answers

1
votes

Any SELECT statement that returns 30 million rows will always give bad performance.

Given that the WHERE statement could be on any of the hundreds of boolean columns, it cannot be optimized by SORTKEY. However, the biggest cause of a slow response would be the need to return millions of rows.

Redshift is excellent if you are doing calculations across millions (or even billions) of rows, such as:

SELECT COUNT(*) FROM table1 WHERE col10 = 1;

This would return a single row.

However, SELECT * would return millions of rows, each with hundreds of columns, so you are facing network latency from the database through to your client. That's a lot of data you're trying to send down.

It would help if you only asked for the column of interest, such as:

SELECT user_id FROM table1 WHERE col10 = 1;

however this would still return a lot of data.

So, if you really need all those rows to come back, then you might need to create some form of caching layer to locally store that data.