PySpark error java.lang.OutOfMemoryError: GC overhead limit exceeded

Question

How can I fix my GC overhead limit exceeded happening with PySpark version 2.2.1. installed on Ubuntu 16.04.4.

Inside the Python 3.5.2 script I setup spark as:

 spark = SparkSession.builder.appName('achats_fusion_files').getOrCreate()                                                                                                                              
 spark.conf.set("spark.sql.pivotMaxValues", "1000000")                                                                                                                                                  
 spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")                                                                                                                                           
 spark.conf.set("spark.executor.memory", "1g")                                                                                                                                                          
 spark.conf.set("spark.driver.memory", "1g")

How can I fix the problem by using the good setting inside the Python script?

Bellow the error message:

18/03/14 09:57:25 ERROR Executor: Exception in task 34.0 in stage 36.0 (TID 2076)                                                                                                                         
java.lang.OutOfMemoryError: GC overhead limit exceeded                                                                                                                                                     
    at java.util.regex.Pattern.compile(Pattern.java:1667)                                                                                                                                              
    at java.util.regex.Pattern.<init>(Pattern.java:1351)                                                                                                                                               
    at java.util.regex.Pattern.compile(Pattern.java:1028)                                                                                                                                              
    at org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:266)                                                                                                                        
    at org.apache.spark.network.util.JavaUtils.byteStringAsBytes(JavaUtils.java:302)                                                                                                                   
    at org.apache.spark.util.Utils$.byteStringAsBytes(Utils.scala:1087)                                                                                                                                
    at org.apache.spark.SparkConf.getSizeAsBytes(SparkConf.scala:310)                                                                                                                                  
    at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:114)                                                                                                      
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)                                                                                                   
    at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:131)                                                                                                           
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:120)                                                                                                            
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)                                                                                                           
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)                                                                                         
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)                                                                                                                      
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)                                                                                                                      
    at org.apache.spark.scheduler.Task.run(Task.scala:108)                                                                                                                                             
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)                                                                                                                           
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)                                                                                                                 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                                                                                                                 
    at java.lang.Thread.run(Thread.java:748)

mayank agrawal mayank agrawal · Accepted Answer · 2018-03-14T10:03:49

Taking straight from the docs,

The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options.
The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect temporary objects created during task execution.
Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn’t enough memory available for executing tasks.
If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be E, then you can set the size of the Young generation using the option -Xmn=4/3*E. (The scaling up by 4/3 is to account for space used by survivor regions as well.)
In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction; it is better to cache fewer objects than to slow down task execution. Alternatively, consider decreasing the size of the Young generation. This means lowering -Xmn if you’ve set it as above. If not, try changing the value of the JVM’s NewRatio parameter. Many JVMs default this to 2, meaning that the Old generation occupies 2/3 of the heap. It should be large enough such that this fraction exceeds spark.memory.fraction.
Try the G1GC garbage collector with -XX:+UseG1GC. It can improve performance in some situations where garbage collection is a bottleneck. (This helped me)

Some more parameters that helped me were,

-XX:ConcGCThreads=20
-XX:InitiatingHeapOcuupancyPercent=35

All GC tuning flags for executors can be specified by setting spark.executor.extraJavaOptions in a job’s configuration.

Check this out for further details.

EDIT:

In you spark-defaults.conf write,

spark.executor.JavaOptions -XX:+UseG1GC

spark.executor.extraJavaOptions -XX:ConcGCThreads=20 -XX:InitiatingHeapOcuupancyPercent=35

PySpark error java.lang.OutOfMemoryError: GC overhead limit exceeded

1 Answers