3
votes

I am testing a large data set (1.5TB, 5.5b records) in athena in both parquet and orc formats. My first test is a simple one, a count query-

SELECT COUNT(*) FROM events_orc
SELECT COUNT(*) FROM events_parquet

The parquet file takes half to run this query as the orc file. But one thing I noticed is that when running a count on a parquet file, it return 0kb as the bytes scanned, where with the orc, it returns 78gb. This makes sense for the parquet because the count is in the meta, no need to scan bytes. The orc also has a meta with the count, but it doesn't seem to be using that meta to determine the counts of these files.

Why doesn't Athena use the metadata in the orc file to determine the count, where it clearly does with parquet files?

2
is your data compressed in case of ORC ?Prabhakar Reddy

2 Answers

1
votes

The answer is as you say that Athena reads the Parquet metadata, but not the ORC. There is no reason besides that feature not being in the version of Presto and/or ORC serde that Athena uses.

I've also noticed that Athena reads too much data when using ORC, it doesn't skip columns it should, etc. I think the Athena ORC serde is just old and doesn't have all the optimisations you would expect. Athena is after all based on a very old Presto version.

0
votes

This depends on how those ORC files were created. Could you explain a bit how did you ETL in the data and what are the table definitions?

There are few indexes that ORC has:

Indexes ORC provides three level of indexes within each file:

file level:

statistics about the values in each column across the entire file

stripe level:

statistics about the values in each column for each stripe

row-level:

statistics about the values in each column for each set of 10,000 rows within a stripe The file and stripe level column statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. Row level indexes include both the column statistics for each row group and the position for seeking to the start of the row group.

Athena just like PrestoDb the query engine used by Athena can use these indexes to speed up queries.

I would be extremely surprised if Athena would not be using these bits of information for the queries.