I am getting the following error while doing a simple count operation on a spark data frame.
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 4778 tasks (1024.3 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Settings:
master : yarn-client
spark.driver.memory : 4g
spark.driver.maxResultSize : 1g
spark.executor.cores : 2
spark.executor.memory : 7g
spark.executor.memoryOverhead : 1g
spark.yarn.am.cores : 1
spark.yarn.am.memory : 3g
spark.yarn.am.memoryOverhead : 1g
I am not sure why this simple piece of code is sending more than 1 GB of data to driver. Ideally, the entire calculation should happen in the executors and only the result should be coming to the driver.
The code might work if I increase the spark.driver.maxResultSize to a higher value, but I want to understand why the driver needs so much memory for a simple count operation.
More Info:
Table : Hive table on top of s3 files
File type : parquet
Compression : snappy
Partition By : date, hr
Data Size : 91 GB [date = '2020-08-13' and hr = 00 to 23]
Number of Sub folders : 24 [hr partitions]
Number of Files : 1286 [date = '2020-08-13' and hr = 00 to 23]