New directStream API reads topic's partitions sequentially. Why?

Question

I am trying to read kafka topic with new directStream method in KafkaUtils. I have Kafka topic with 8 partitions. I am running streaming job on yarn with 8 execuors with 1 core(--num-executors 8 --executor-cores 1) for each one. So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want. I want spark to read all partitions in parallel. How can I achieve that?

Thank you, in advance.

Have you got some more insights regarding this? I use spark standalone mode, so I cannot exactly set number of execturs, but I am interested what will happen if I have 2 topics, and total number of cores set to 2? — Srdjan Nikitovic

Francois G Francois G · Accepted Answer · 2015-11-08T10:49:42

An initial communication to Kafka at job creation occurs, solely to set the offsets of the KafkaRDD - more specifically, the offsets for each KafkaRDD partition that makes up the KafkaRDD across the cluster.

They are then used to fetch data once the job is actually executed, on each Executor. Depending on what you noticed it's possible you may have seen that initial communication (from the driver). If you have seen all your jobs executing on the same executor, then something else would be going wrong than just using Kafka.

New directStream API reads topic's partitions sequentially. Why?

1 Answers