Total size of serialized results of n tasks (x MB) is bigger than spark.driver.maxResultSize

Question

I am getting the following error while doing a simple count operation on a spark data frame.

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 4778 tasks (1024.3 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

Settings:

master : yarn-client
spark.driver.memory : 4g
spark.driver.maxResultSize : 1g
spark.executor.cores : 2
spark.executor.memory :  7g
spark.executor.memoryOverhead : 1g
spark.yarn.am.cores : 1
spark.yarn.am.memory : 3g
spark.yarn.am.memoryOverhead : 1g

I am not sure why this simple piece of code is sending more than 1 GB of data to driver. Ideally, the entire calculation should happen in the executors and only the result should be coming to the driver.

The code might work if I increase the spark.driver.maxResultSize to a higher value, but I want to understand why the driver needs so much memory for a simple count operation.

More Info:

Table : Hive table on top of s3 files
File type : parquet
Compression : snappy
Partition By : date, hr
Data Size : 91 GB [date = '2020-08-13' and hr = 00 to 23]
Number of Sub folders : 24 [hr partitions]
Number of Files : 1286 [date = '2020-08-13' and hr = 00 to 23]

Additional query with sub partition condition.

Samir Vyas Samir Vyas · Accepted Answer · 2020-08-15T16:20:30

I think it is sending the result to your driver since you have used select * and then expects driver to perform count.

If you want count then you can try select count(*).

Total size of serialized results of n tasks (x MB) is bigger than spark.driver.maxResultSize

1 Answers