0
votes

I have a 10GB gzip compressed file in S3 that I need to process in EMR Spark. I need to load it, do a full outer join and write it back to S3. The data that I full outer join with is the target dataset that I thought to save as parquet.

I can't make the input file spliced before (since it comes from a third party) and can only change the compression to bz2.

Any suggestion how to make the process of using the input file most efficient? Currently when just using spark.read.csv it takes very long time and running only one task so it can't be distributed.

2

2 Answers

1
votes

Make step 1 a single-worker operation of reading in the file & write it back as snappy encrypted parquet, before doing the join. Once it's written like that, you've got a format which can be split up for the join.

0
votes

I'd recommend launching an EC2 instance in the same region as the bucket, downloading the 10GB file, unzipping it, and uploading it back to S3. Using the aws-cli this should only take about 15 minutes in total. For example:

aws s3 sync s3://bucket_name/file.txt.gz .;

gunzip file.txt.gz;

aws s3 sync file.txt s3://bucket_name/;