I'm currently working on huge logs with PySpark and I'm facing some memory issues on my cluster.
It gaves me the following error :
HTTP ERROR 500
Problem accessing /jobs/. Reason:
Server Error Caused by:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Here's my current configuration :
spark.driver.cores 3
spark.driver.memory 6g
spark.executor.cores 3
spark.executor.instances 20
spark.executor.memory 6g
spark.yarn.executor.memoryOverhead 2g
First, I don't cache/persist anything in my spark job.
I've read that it could be something with memoryOverhead, that's why I've increased it. But it seems that it's not enough. I've also read that it also could an issue with the garbage collector. And that's my main question here, what's the best practice when you have to deal with many different databases.
I have to do a lot of JOIN, I'm doing that with SparkSql, and I'm creating a lot of TempViews. Is that a bad practice ? Would it be better to make some huge SQL requests and do like 10 joins inside one SQL request ? It will reduce code readability but could it help to solve my problem ?
Thanks,
-Xmx
parameter value passed to spark's driver JVM? Did you try to connect to the driver using visualvm to check what objects are taking so much memory? – Mariuszspark.memory.fraction=0.6
. If it is higher than that you run into garbage collection errors, see stackoverflow.com/a/47283211/179014 – asmaier