TLDR : Athena: select top 10
scans more data for parquet format, than csv format. Shouldn't it be the other way round?
I am using Athena(V1) to query the following two datasets (same data but two different file formats):
Format | Size | Athena DB name | Athena table name | dataset description |
---|---|---|---|---|
CSV | 91.3 MB | nycitytaxi | data | nycity taxi trip, present in a public s3 bucket |
Parquet | 19.4 MB | nycitytaxi | aws_glue_result_xxxx | same data as above converted to parquet - through a Glue Crawler job - and stored in one of my S3 buckets |
Now I am executing the following query on both the tables :
select lpep_pickup_datetime, lpep_dropoff_datetime
from nycitytaxi.<table_name>
limit 10
On executing this query on the csv based table (table_name: data), Athena console shows it scanned 721.96 KB of data.
On executing this query on the parquet based table (table_name : aws_glue_result_xxxx), Athena console shows it scanned 10.9 MB of data.
Shouldn't Athena be scanning way less data for the parquet based table, since parquet is columnar based, as opposed to row based storage for CSV ?