I have a simple streaming job which pulls data from kafka topic and push it to S3.
df2 = parsed_df \
.coalesce(1)\
.writeStream.format("parquet")\
.option("checkpointLocation", "<s3location>")\
.option("path","s3location")\
.partitionBy("dt")\
.outputMode("Append")\
.trigger(processingTime='150 seconds')\
.start()
Triggering time is 150 seconds. My spark config is below for this job.
"driverMemory": "6G",
"driverCores": 1,
"executorCores": 1,
"executorMemory": "3G",
{
"spark.dynamicAllocation.initialExecutors": "3",
"spark.dynamicAllocation.maxExecutors": "12",
"spark.driver.maxResultSize": "4g",
"spark.sql.session.timeZone":"UTC",
"spark.executor.memoryOverhead": "1g",
"spark.driver.memoryOverhead": "2g",
"spark.dynamicAllocation.enabled": "true",
"spark.rpc.message.maxSize": "1024",
"spark.streaming.receiver.maxRate": "4000",
"spark.port.maxRetries" : "100",
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4,org.apache.spark:spark-streaming-kafka-0-10-assembly_2.12:2.4.4"
}
Job is running fine. But when I checked my spark UI, I see many dead executors.
These dead executors keep on increasing. For every batch of 150 seconds, I am processing 3-5k events. My questions are :-
- Is this a valid scenario?
- If this is not a valid scenario, then what may be the reason? Is it because the dynamic allocation property is set to true?