select distinct performance is not consistent

Question

There is a distinct query on a single table

select distinct d, e, f, a, b, c from t where a = 1 and e = 2;

The number of distinct values in cols a, b, c are high (high column cardinality) and cols d, e, f are low cardinality columns. My data is in ORC format in S3, and I have external table in Athena and Redshift spectrum pointing to the same file.

When above query is run in athena it comes back in couple of secs, whereas in redshift spectrum it takes couple of minutes.

But when I move col f at the end of the select list, it works fine in Redshift spectrum too. This happens for only for this particular column, I mean moving d or e at the end does not make any difference i.e. they run longer. The col f is a varchar column as are others and the max length of this column is 30 bytes.

Two questions

(a) Any insight or pointers to the peculiar behavior where moving col f to the end of the list makes it run faster whereas putting it in between makes it slower
(b) Is there a recommended SQL best practice to list the columns in decreasing order of column cardinality in distinct or group by statements? Does it make difference in the execution times if columns of lower cardinality are put first or if they are put in mixed arrangement?

Related: cybertec-postgresql.com/en/speeding-up-group-by-in-postgresql (that's for Postgres, but might apply to Redshift as well) — a_horse_with_no_name
Thanks, so this suggests to put the distinct columns first - which is getting faster results. — nmakb

Jon Scott Jon Scott · Accepted Answer · 2019-04-05T07:45:14

Updating your Redshift driver to the latest version can usually bring your Redshift Spectrum speed almost in line with Athena.

https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver

This may not be the cause in your use case but it is definitely worth a try!

select distinct performance is not consistent

1 Answers