Pyspark job fails when I try to persist a DataFrame that was created on a table of size ~270GB with error
Exception in thread "yarn-scheduler-ask-am-thread-pool-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
This issue happens only when I try to persist. Below are the configurations, I tried playing around with executor/driver memory, shuffle partitions, dynamic allocation of executors and persist storage level(DISK_ONLY, MEMORY_AND_DISK). My intention is to partition the data on a key and persist, so my consecutive joins will be faster. Any suggestion will be of great help.
Spark version: 1.6.1(MapR Distribution)
Data size: ~270GB
Configuration:
spark.executor.instances - 300
spark.executor.memory - 10g
spark.executor.cores - 3
spark.driver.memory - 10g
spark.yarn.executor.memoryOverhead - 2048
spark.io.compression.codec - lz4
Normal query
query = "select * from tableA"
df = sqlctx.sql(query)
df.count()
Successful run with no persist()
Repartitioning & Persist
Keeping shuffle block in mind, picked 2001 as partitions so each partitions will be approx 128M of data.
test = df.repartition(2001, "key")
test.persist(StorageLevel.DISK_ONLY)
test.count()