I have a 10GB gzip compressed file in S3 that I need to process in EMR Spark. I need to load it, do a full outer join and write it back to S3. The data that I full outer join with is the target dataset that I thought to save as parquet.
I can't make the input file spliced before (since it comes from a third party) and can only change the compression to bz2.
Any suggestion how to make the process of using the input file most efficient? Currently when just using spark.read.csv it takes very long time and running only one task so it can't be distributed.