I'm trying to process, 10GB of data using spark it is giving me this error,
java.lang.OutOfMemoryError: GC overhead limit exceeded
Laptop configuration is: 4CPU, 8 logical cores, 8GB RAM
Spark configuration while submitting the spark job.
spark = SparkSession.builder.master('local[6]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 5)
After searching internet about this error, I have few questions
If answered that would be a great help.
1) Spark is in memory computing engine, for processing 10 gb of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ gb RAM memory and then do the job?
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
4) What should be ideal spark configuration if the system properites are 8gb RAM and 8 Cores? for processing 8gb of data
spark configuration to be defined in spark config.
spark = SparkSession.builder.master('local[?]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", ?)
spark.conf.set("spark.executor.cores", ?)
spark.executors.cores = ?
spark.executors.memory = ?
spark.yarn.executor.memoryOverhead =?
spark.driver.memory =?
spark.driver.cores =?
spark.executor.instances =?
No.of core instances =?
spark.default.parallelism =?