Redshift: Disadvantages of having a lot of nulls/empties in a large varchar column

Question

I have a varchar column of max size 20,000 in my redshift table. About 60% of the rows will have this column as null or empty. What is the performance impact in such cases. From this documentation I read:

Because Amazon Redshift compresses column data very effectively, creating columns much larger than necessary has minimal impact on the size of data tables. During processing for complex queries, however, intermediate query results might need to be stored in temporary tables. Because temporary tables are not compressed, unnecessarily large columns consume excessive memory and temporary disk space, which can affect query performance.

So this means query performance might be bad in this case. Is there any other disadvantage apart from this?

what queries do you run that include that column? (please update your question with examples) — Jon Scott

Rahul Gupta Rahul Gupta · Accepted Answer · 2017-11-10T17:44:54

To store in redshift table, there is no significant performance degradation as suggested in documentation, compression encoding help in keeping the data compact.

Whereas when you query the column with null values, extra processing is required, for instance, using it in where clause. This might impact the performance of your query. So performance depends on your query.

EDIT (answer to your comment) - Redshift stores your each column in "blocks" and these blocks are sorted according to the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned. Query your disk space for the particular column and check size against other columns.

If I’ve made a bad assumption please comment and I’ll refocus my answer.

Redshift: Disadvantages of having a lot of nulls/empties in a large varchar column

1 Answers