Athena (Hive/Presto) Parquet vs ORC In Count Query

Question

I am testing a large data set (1.5TB, 5.5b records) in athena in both parquet and orc formats. My first test is a simple one, a count query-

SELECT COUNT(*) FROM events_orc
SELECT COUNT(*) FROM events_parquet

The parquet file takes half to run this query as the orc file. But one thing I noticed is that when running a count on a parquet file, it return 0kb as the bytes scanned, where with the orc, it returns 78gb. This makes sense for the parquet because the count is in the meta, no need to scan bytes. The orc also has a meta with the count, but it doesn't seem to be using that meta to determine the counts of these files.

Why doesn't Athena use the metadata in the orc file to determine the count, where it clearly does with parquet files?

Theo Theo · Accepted Answer · 2020-09-12T07:18:19

The answer is as you say that Athena reads the Parquet metadata, but not the ORC. There is no reason besides that feature not being in the version of Presto and/or ORC serde that Athena uses.

I've also noticed that Athena reads too much data when using ORC, it doesn't skip columns it should, etc. I think the Athena ORC serde is just old and doesn't have all the optimisations you would expect. Athena is after all based on a very old Presto version.

Athena (Hive/Presto) Parquet vs ORC In Count Query

2 Answers