Pyspark - DataFrame persist() errors out java.lang.OutOfMemoryError: GC overhead limit exceeded

Question

Pyspark job fails when I try to persist a DataFrame that was created on a table of size ~270GB with error

Exception in thread "yarn-scheduler-ask-am-thread-pool-9" java.lang.OutOfMemoryError: GC overhead limit exceeded

This issue happens only when I try to persist. Below are the configurations, I tried playing around with executor/driver memory, shuffle partitions, dynamic allocation of executors and persist storage level(DISK_ONLY, MEMORY_AND_DISK). My intention is to partition the data on a key and persist, so my consecutive joins will be faster. Any suggestion will be of great help.

Spark version: 1.6.1(MapR Distribution)
Data size: ~270GB
Configuration:
spark.executor.instances - 300
spark.executor.memory - 10g
spark.executor.cores - 3
spark.driver.memory - 10g
spark.yarn.executor.memoryOverhead - 2048
spark.io.compression.codec - lz4

Normal query

query = "select * from tableA"
df = sqlctx.sql(query)
df.count()

Successful run with no persist()

Repartitioning & Persist

Keeping shuffle block in mind, picked 2001 as partitions so each partitions will be approx 128M of data.

test = df.repartition(2001, "key")
test.persist(StorageLevel.DISK_ONLY)
test.count()

GC Error - on Persist()

ruslangm ruslangm · Accepted Answer · 2019-02-14T07:45:06

Did you try to increase the spark.executor.memory to its max? (if you cluster has enough memory)
Did you try to increase the partitions number to increase the parallelism? And thus to decrease size of the partitions themselves.

Also analyze statistics on how frequently garbage collection occurs and the amount of time spent GC. This article will be very useful for analyzing and configuring work of GC.

Pyspark - DataFrame persist() errors out java.lang.OutOfMemoryError: GC overhead limit exceeded

Normal query

Repartitioning & Persist

1 Answers