You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
Let's say we have a data lake of people with first_name
, last_name
and country
columns.
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count()
, then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name
and country
data to run this query.
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name
column to run the query.
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count()
is run, then S3 will only transfer the data in the first_name
column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?