AWS Athena | CSV vs Parquet | size of data scanned

Question

TLDR : Athena: select top 10 scans more data for parquet format, than csv format. Shouldn't it be the other way round?

I am using Athena(V1) to query the following two datasets (same data but two different file formats):

Format	Size	Athena DB name	Athena table name	dataset description
CSV	91.3 MB	nycitytaxi	data	nycity taxi trip, present in a public s3 bucket
Parquet	19.4 MB	nycitytaxi	aws_glue_result_xxxx	same data as above converted to parquet - through a Glue Crawler job - and stored in one of my S3 buckets

Now I am executing the following query on both the tables :

select lpep_pickup_datetime, lpep_dropoff_datetime 
from nycitytaxi.<table_name>
limit 10

On executing this query on the csv based table (table_name: data), Athena console shows it scanned 721.96 KB of data.

On executing this query on the parquet based table (table_name : aws_glue_result_xxxx), Athena console shows it scanned 10.9 MB of data.

Shouldn't Athena be scanning way less data for the parquet based table, since parquet is columnar based, as opposed to row based storage for CSV ?

Sandeep Singh Sandeep Singh · Accepted Answer · 2021-02-27T13:46:56

It is due to your specific query.

select lpep_pickup_datetime, lpep_dropoff_datetime 
from nycitytaxi.<table_name>
limit 10

In row based formats like CSV, all data is stored row wise. Which means as soon as you say, select any 10 rows, it can just start reading the csv file from the beginning and select the first 10 rows, resulting in very low data scan.

In columnar data formats like parquet, the records are stored column wise. Let us assume the data has three columns, say id, name, number. This means, all of id values will be stored together, all name values will be stored together and all number values will be stored together. So when you run the query, select 10 rows in parquet, i will have to scan for 10 values in each column which are present in different storage locations. Which means I will have to scan more.

AWS Athena | CSV vs Parquet | size of data scanned

1 Answers