How can I fix my GC overhead limit exceeded
happening with PySpark version 2.2.1. installed on Ubuntu 16.04.4.
Inside the Python 3.5.2 script I setup spark as:
spark = SparkSession.builder.appName('achats_fusion_files').getOrCreate()
spark.conf.set("spark.sql.pivotMaxValues", "1000000")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
spark.conf.set("spark.executor.memory", "1g")
spark.conf.set("spark.driver.memory", "1g")
How can I fix the problem by using the good setting inside the Python script?
Bellow the error message:
18/03/14 09:57:25 ERROR Executor: Exception in task 34.0 in stage 36.0 (TID 2076)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.compile(Pattern.java:1667)
at java.util.regex.Pattern.<init>(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:266)
at org.apache.spark.network.util.JavaUtils.byteStringAsBytes(JavaUtils.java:302)
at org.apache.spark.util.Utils$.byteStringAsBytes(Utils.scala:1087)
at org.apache.spark.SparkConf.getSizeAsBytes(SparkConf.scala:310)
at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:114)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:131)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:120)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)