I created a Dataproc cluster with 1 master and 10 nodes. All have the same CPU and memory configuration: 32 vCPU, 120 GB memory. When I submitted a job that handles big amount of data and calculation. The job failed.
From the logging, I am not really sure about what caused the failure. But I saw the memory related error message from tJob#: job-c46fc848-6: Container killed by YARN for exceeding memory limits. 24.1 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
So I tried a few solutions I found from other posts. For example, I tried to increase spark.executor.memoryOverhead and spark.driver.maxResultSize in the "Properties" section when submitting a job from the "Jobs" console. The job# find-duplicate-job-c46fc848-7 still failed.
I was also seeing warning messages and not really sure what it means: 18/06/04 17:13:25 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: No more replicas available for rdd_43_155 !
I am going to try to create a higher level cluster to see if it works. But I doubt it would solve the issue since cluster with 1 master and 10 nodes with 32 vCPU, 120 GB memory are already very powerful.
Hope to get some help from advanced users and experts. Thanks in advance!