6
votes

I'm relatively new to PySpark. I have been trying to cache a 30GB data because I need to perform clustering on it. So performing any action like count initially I was getting some heap space issue. So I googled and found that increasing the executor/driver memory will do it for me. So, here's my current configuration

SparkConf().set('spark.executor.memory', '45G')
.set('spark.driver.memory', '80G')
.set('spark.driver.maxResultSize', '10G')

But now I'm getting this garbage collection issue. I checked SO but everywhere the answers are quite vague. People are suggesting to play with the configuration. Is there any better way to figure what the configuration should be? I know that this is just a debug exception and I can turn it off. But still I want to learn a bit of maths for calculating the configurations on my own.

I'm currently on a server with 256GB RAM. Any help is appreciated. Thanks in advance.

2

2 Answers

3
votes

How many cores does your server/cluster have?

What this GC error is saying is that spark has spent at least 98% of the run time garbage collecting (cleaning up unused objects from memory) but has managed to free <2% of memory while doing so. I don't think its avoidable, as you suggest, because it indicates that memory is almost full and garbage collection is needed. Suppressing this message would likely just lead to an out of memory error shortly afterwards. This link will give you the details about what this error means. Solving it can be as simple as messing around with config settings, as you have mentioned, but it can also mean you need code fixes. Reducing how many temporary objects are being stored, making your dataframe as compact as it could be (encoding strings as indices, for example), and performing joins or other operations at the right time (most memory efficient) can all help. Look into broadcasting smaller dataframes for joins. Its tough to suggest anything without seeing code., as will this resource.

For your spark config tuning, this link should provide all the info you need. Your config settings seem very high at first glance, but I don't know your cluster setup.

0
votes

I had the same error. I reduced the size of the spark dataframe before converting to Pandas. I also added pyarrow to spark config settings.

I started with: conda install -c conda-forge pyarrow -y

Added this to the code: .config("spark.sql.execution.arrow.enabled","true")\

And broke up calls as follows (optional I think):

    df=spark.sql(f"""select * from {hdfs_db}.{hdfs_tbl}""")
    ## === Select a few columns
    df = df.select(['sbsb_id', 'rcvd_dt', 'inq_tracking_id', 'comments'])
    ## === Convert to Pandas
    df = df.toPandas()