I'm trying to read two partitions from a hive table using two spark sql with clauses and using left outer join on both to get the deltas, both the partitions are having 27 billion records with a size of 900GB, there are 10 files in each partition of 90GB. File format is parquet with snappy compression.
I'm running the pyspark job in a aws emr r4.16xlarge cluster with 28 nodes. I have tried various spark configurations, but each time the job is failing with Job aborted due to stage failure: most recent failure: Lost task java.io.IOException: No space left on device
error
I have tried various spark configurations, If I'm not wrong I guess the job is running out of tmp space in the worker nodes, I tried setting the "spark.sql.shuffle.partitions=3000" but even then its failing, any idea how I can fix this?
spark configurations tried so far
try:1
--executor-cores 5 --num-executors 335 --executor-memory 37G --driver-memory 366G
try:2
'--driver-memory 200G --deploy-mode client --executor-memory 40G --executor-cores 4 '
'--conf spark.dynamicAllocation.enabled=true ' \
'--conf spark.shuffle.service.enabled=true ' \
'--conf spark.executor.memoryOverhead=30g ' \
'--conf spark.rpc.message.maxSize=1024 '\
'--conf spark.sql.shuffle.partitions=3000 ' \
'--conf spark.sql.autoBroadcastJoinThreshold=-1 ' \
'--conf spark.driver.maxResultSize=4G '\
'--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2'
try:3
'--driver-memory 200G --deploy-mode client --executor-memory 100G --executor-cores 4 ' \
'--conf spark.dynamicAllocation.enabled=true ' \
'--conf spark.shuffle.service.enabled=true '