0
votes

I am getting the following error while doing a simple count operation on a spark data frame.

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 4778 tasks (1024.3 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

enter image description here

Settings:

master : yarn-client
spark.driver.memory : 4g
spark.driver.maxResultSize : 1g
spark.executor.cores : 2
spark.executor.memory :  7g
spark.executor.memoryOverhead : 1g
spark.yarn.am.cores : 1
spark.yarn.am.memory : 3g
spark.yarn.am.memoryOverhead : 1g

I am not sure why this simple piece of code is sending more than 1 GB of data to driver. Ideally, the entire calculation should happen in the executors and only the result should be coming to the driver.

The code might work if I increase the spark.driver.maxResultSize to a higher value, but I want to understand why the driver needs so much memory for a simple count operation.

More Info:

Table : Hive table on top of s3 files
File type : parquet
Compression : snappy
Partition By : date, hr
Data Size : 91 GB [date = '2020-08-13' and hr = 00 to 23]
Number of Sub folders : 24 [hr partitions]
Number of Files : 1286 [date = '2020-08-13' and hr = 00 to 23]

Additional query with sub partition condition. enter image description here

1

1 Answers

-2
votes

I think it is sending the result to your driver since you have used select * and then expects driver to perform count.

If you want count then you can try select count(*).