pyspark java.lang.OutOfMemoryError: GC overhead limit exceeded

Question

I'm trying to process, 10GB of data using spark it is giving me this error,

java.lang.OutOfMemoryError: GC overhead limit exceeded

Laptop configuration is: 4CPU, 8 logical cores, 8GB RAM

Spark configuration while submitting the spark job.

spark = SparkSession.builder.master('local[6]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 5)

After searching internet about this error, I have few questions

If answered that would be a great help.

1) Spark is in memory computing engine, for processing 10 gb of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ gb RAM memory and then do the job?

2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?

3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.

4) What should be ideal spark configuration if the system properites are 8gb RAM and 8 Cores? for processing 8gb of data

spark configuration to be defined in spark config.

spark = SparkSession.builder.master('local[?]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", ?)
spark.conf.set("spark.executor.cores", ?)

spark.executors.cores = ?

spark.executors.memory = ?

spark.yarn.executor.memoryOverhead =?

spark.driver.memory =?

spark.driver.cores =?

spark.executor.instances =?

No.of core instances =?

spark.default.parallelism =?

The problem is likely to be with your code than with your configuration or your cluster resources. More often than not, you'll get this OOM error by unnecessarily collecting data onto a single node or driver. Please post your code and indicate which action/operation is causing the error. — ernest_k
It depend on what your code does and how your data split (is it a one file that size 10Gb? or it split into small files?). — ShemTov

Data Engineering Simplified Data Engineering Simplified · Accepted Answer · 2020-04-14T09:08:30

I hope the following will help if not clarify everything.

1) Spark is an in-memory computing engine, for processing 10 GB of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ GB RAM memory and then do the job?

Spark being an in-memory computation engine take the input/source from an underlying data lake or distributed storage system. The 10Gb file will be broken into smaller blocks (128Mb or 256Mb block size for Hadoop based data lake) and Spark Driver will get many executor/cores to read them from the cluster's worker node. If you try to load 10Gb data with laptop or with a single node, it will certainly go out of memory. It has to load all the data either in one machine or in many slaves/worker-nodes before it is processed.

2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?

The large data processing project design the storage and access layer with a lot of design patterns. They simply don't dump GBs or TBs of data to file system like HDFS. They use partitions (like sales transaction data is partition by month/week/day) and for structured data, there are different file formats available (especially columnar) which helps to lad only those columns which are required for processing. So right file format, partitioning, and compaction are the key attributes for large files.

3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.

Very unlikely if there is no partition but there are ways. It also depends on what kind of file it is. You can create a custom stream file reader that can read the logical block and process it. However, the enterprise doesn't read 50Gb of one file which is one single unit. Even if you load an excel file of 10Gb in your machine via Office tool, it will go out of memory.

4) What should be ideal spark configuration if the system properties are 8gb RAM and 8 Cores? for processing 8gb of data

Leave 1 core & 1-Gb or 2-GB for OS and use the rest of them for your processing. Now depends on what kind of transformation is being performed, you have to decide the memory for driver and worker processes. Your driver should have 2Gb of RAM. But laptop is primarily for the playground to explore the code syntax and not suitable for large data set. Better to build your logic with dataframe.sample() and then push the code to bigger machine to generate the output.

pyspark java.lang.OutOfMemoryError: GC overhead limit exceeded

1 Answers