Using BYTEDICT compression with VARCHAR columns in Redshift: can someone please clarify this statement from the docs?

Question

I'm reading through the AWS Redshift documentation regarding compression types. In the section on BYTEDICT compression it says the following:

Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly recommend using a different encoding, such as LZO, for VARCHAR columns.

Assuming that "large VARCHAR columns" means "high cardinality," that recommendation makes sense. However, the last sentence seems to say one shouldn't bother using BYTEDICT with VARCHAR at all. That doesn't make sense to me though. If you had millions of VARCHAR rows, but the cardinality was low (e.g. Canadian Provinces), wouldn't BYTEDICT be the best choice?

Jon Scott Jon Scott · Accepted Answer · 2019-06-19T17:04:20

The important word here is "always", meaning that sometimes it is and sometimes it is not the best option.

"Byte-dictionary encoding is not always effective when used with VARCHAR columns"

Bytedict works fine in the use case that you have set out of course, that is what it is there for!

If this is important to your use case then you must try out the options both for storage and query performance.

Using BYTEDICT compression with VARCHAR columns in Redshift: can someone please clarify this statement from the docs?

1 Answers