What happens in EMR cluster mode with default configuration?

Question

I am running a spark application with following stages and configuration on amazon emr

Stages:

        dstream.map(record => transformRecord).map(result => result._1).flatMap(rd => rd).foreacRDD(rdd => { rdd.toDF; df.save() })

Configuration: 1 master node with 2 core nodes in yarn cluster mode. All other spark properties are default with default 2 spark executors , 4 spark executor cores with a memory of 2g

Use-case:

Consume stream of json records from message broker, transform them , persist them onto database

Question:

With this config in place when spark-submit is executed - i see that only one spark executor is consuming records and processing them. The other one just acts like some scheduler . Why this happens ?
How to increase parallel processing in the sense consume more records and execute them in isolation ?? (Does increasing number of exectuors will make any difference)
What's the relationship between spark executors and parallelism in spark on yarn?

Mozhi Mozhi · Accepted Answer · 2019-06-03T04:24:26

After reading multiple blogs tried out few things ,

Answer:

First map stage is backed by spark receiver thread either from kafka/kinesis . So they listen on a shard and one thread , create multiple dstreams to increase the read parallelism .

Remaining am still yet to figure them out.

What happens in EMR cluster mode with default configuration?

1 Answers