3
votes

I'm currently working on huge logs with PySpark and I'm facing some memory issues on my cluster.

It gaves me the following error :

HTTP ERROR 500

Problem accessing /jobs/. Reason:

Server Error Caused by:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Here's my current configuration :

spark.driver.cores  3
spark.driver.memory 6g
spark.executor.cores    3
spark.executor.instances    20
spark.executor.memory   6g
spark.yarn.executor.memoryOverhead  2g

First, I don't cache/persist anything in my spark job.

I've read that it could be something with memoryOverhead, that's why I've increased it. But it seems that it's not enough. I've also read that it also could an issue with the garbage collector. And that's my main question here, what's the best practice when you have to deal with many different databases.

I have to do a lot of JOIN, I'm doing that with SparkSql, and I'm creating a lot of TempViews. Is that a bad practice ? Would it be better to make some huge SQL requests and do like 10 joins inside one SQL request ? It will reduce code readability but could it help to solve my problem ?

Thanks,

1
What spark version do you use? Are you able to verify -Xmx parameter value passed to spark's driver JVM? Did you try to connect to the driver using visualvm to check what objects are taking so much memory?Mariusz
Make sure your spark.memory.fraction=0.6 . If it is higher than that you run into garbage collection errors, see stackoverflow.com/a/47283211/179014asmaier

1 Answers

0
votes

Well, I think I've fixed my issue. It was something about Broadcasting.

I think that as my joins are pretty big, they need quite some time, so I've disabled broadcasting with :

config("spark.sql.autoBroadcastJoinThreshold", "-1")

Problem seems solved.

Thanks,