There is a distinct query on a single table
select distinct d, e, f, a, b, c from t where a = 1 and e = 2;
The number of distinct values in cols a, b, c are high (high column cardinality) and cols d, e, f are low cardinality columns. My data is in ORC format in S3, and I have external table in Athena and Redshift spectrum pointing to the same file.
When above query is run in athena it comes back in couple of secs, whereas in redshift spectrum it takes couple of minutes.
But when I move col f at the end of the select list, it works fine in Redshift spectrum too. This happens for only for this particular column, I mean moving d or e at the end does not make any difference i.e. they run longer. The col f is a varchar column as are others and the max length of this column is 30 bytes.
Two questions
(a) Any insight or pointers to the peculiar behavior where moving col f to the end of the list makes it run faster whereas putting it in between makes it slower
(b) Is there a recommended SQL best practice to list the columns in decreasing order of column cardinality in distinct or group by statements? Does it make difference in the execution times if columns of lower cardinality are put first or if they are put in mixed arrangement?