I am testing a large data set (1.5TB, 5.5b records) in athena in both parquet and orc formats. My first test is a simple one, a count query-
SELECT COUNT(*) FROM events_orc
SELECT COUNT(*) FROM events_parquet
The parquet file takes half to run this query as the orc file. But one thing I noticed is that when running a count on a parquet file, it return 0kb
as the bytes scanned, where with the orc, it returns 78gb
. This makes sense for the parquet because the count is in the meta, no need to scan bytes. The orc also has a meta with the count, but it doesn't seem to be using that meta to determine the counts of these files.
Why doesn't Athena use the metadata in the orc file to determine the count, where it clearly does with parquet files?