Efficient way to work in s3 gzip file from EMR Spark

Question

I have a 10GB gzip compressed file in S3 that I need to process in EMR Spark. I need to load it, do a full outer join and write it back to S3. The data that I full outer join with is the target dataset that I thought to save as parquet.

I can't make the input file spliced before (since it comes from a third party) and can only change the compression to bz2.

Any suggestion how to make the process of using the input file most efficient? Currently when just using spark.read.csv it takes very long time and running only one task so it can't be distributed.

stevel stevel · Accepted Answer · 2017-05-09T09:21:03

Make step 1 a single-worker operation of reading in the file & write it back as snappy encrypted parquet, before doing the join. Once it's written like that, you've got a format which can be split up for the join.

Efficient way to work in s3 gzip file from EMR Spark

2 Answers