Athena: Minimize data scanned by query including JOIN operation

Question

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.

When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.

However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.

Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.

John Rotenstein John Rotenstein · Accepted Answer · 2017-08-03T02:01:40

That's an interesting discovery!

You might be able to avoid the large scan by using a sub-query instead of a join.

Instead of:

SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'

you might be able to use:

SELECT ...
FROM large-table
WHERE large-table.date IN
         (SELECT date from small-table
          WHERE date > '2017-08-03')

I haven't tested it, but that would avoid the JOIN you mention.

Athena: Minimize data scanned by query including JOIN operation

1 Answers