I'm relatively new to PySpark. I have been trying to cache a 30GB data because I need to perform clustering on it. So performing any action like count
initially I was getting some heap space issue
. So I googled and found that increasing the executor/driver memory will do it for me. So, here's my current configuration
SparkConf().set('spark.executor.memory', '45G')
.set('spark.driver.memory', '80G')
.set('spark.driver.maxResultSize', '10G')
But now I'm getting this garbage collection issue
. I checked SO but everywhere the answers are quite vague. People are suggesting to play with the configuration. Is there any better way to figure what the configuration should be? I know that this is just a debug exception and I can turn it off. But still I want to learn a bit of maths for calculating the configurations on my own.
I'm currently on a server with 256GB RAM. Any help is appreciated. Thanks in advance.