I am running a spark application with following stages and configuration on amazon emr
Stages:
dstream.map(record => transformRecord).map(result => result._1).flatMap(rd => rd).foreacRDD(rdd => { rdd.toDF; df.save() })
Configuration: 1 master node with 2 core nodes in yarn cluster mode. All other spark properties are default with default 2 spark executors , 4 spark executor cores with a memory of 2g
Use-case:
Consume stream of json records from message broker, transform them , persist them onto database
Question:
With this config in place when spark-submit is executed - i see that only one spark executor is consuming records and processing them. The other one just acts like some scheduler . Why this happens ?
How to increase parallel processing in the sense consume more records and execute them in isolation ?? (Does increasing number of exectuors will make any difference)
What's the relationship between spark executors and parallelism in spark on yarn?