Pyspark job failing while reading parquet file with billions of records in AWS EMR

Question

I'm trying to read two partitions from a hive table using two spark sql with clauses and using left outer join on both to get the deltas, both the partitions are having 27 billion records with a size of 900GB, there are 10 files in each partition of 90GB. File format is parquet with snappy compression.

I'm running the pyspark job in a aws emr r4.16xlarge cluster with 28 nodes. I have tried various spark configurations, but each time the job is failing with Job aborted due to stage failure: most recent failure: Lost task java.io.IOException: No space left on device error

I have tried various spark configurations, If I'm not wrong I guess the job is running out of tmp space in the worker nodes, I tried setting the "spark.sql.shuffle.partitions=3000" but even then its failing, any idea how I can fix this?

spark configurations tried so far

try:1
    --executor-cores 5 --num-executors 335 --executor-memory 37G --driver-memory 366G

try:2
    '--driver-memory 200G --deploy-mode client --executor-memory 40G --executor-cores 4 ' 
                   '--conf spark.dynamicAllocation.enabled=true ' \
                   '--conf spark.shuffle.service.enabled=true ' \
                   '--conf spark.executor.memoryOverhead=30g '  \
                   '--conf spark.rpc.message.maxSize=1024 '\
                   '--conf spark.sql.shuffle.partitions=3000 ' \
                   '--conf spark.sql.autoBroadcastJoinThreshold=-1 ' \
                   '--conf spark.driver.maxResultSize=4G '\
                   '--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' 

try:3
'--driver-memory 200G --deploy-mode client --executor-memory 100G --executor-cores 4 ' \
               '--conf spark.dynamicAllocation.enabled=true ' \
               '--conf spark.shuffle.service.enabled=true '

过过招过过招 · Accepted Answer · 2021-09-30T01:25:10

In my limited experience with spark,the cause of this error may be insufficient temporary space. You can try to modify the spark-env.sh configuration.

export SPARK_WORKER_DIR=dir_have_enough_space
export SPARK_LOCAL_DIRS=dir_have_enough_space

Pyspark job failing while reading parquet file with billions of records in AWS EMR

2 Answers