0
votes

So I need to specify how an executor should consume data from a kafka topic.

Let's say I have 2 topics : t0 and t1 with two partitions each, and two executors e0 and e1 (both can be on the same node so assign strategy does not work since in the case of a multi executor node it works based on round robin scheduling, whatever first available executor consumes the topic partition )

What I would like to do is make e0 consume partition 0 from both t0 and t1 while e1 consumes partition 1 from the t0 and t1. Is there no way around it except messing with scheduling ? If so what's the best approach.

The reason for doing so is that executors will write to a cassandra database and since we will be in a parallelized context one executor might "collide" with another and therefore data will be lost, by assigning a partition I want to force the executor to process the data sequentially.

2
Is there a specific reason you'd like to do this? It might help to explain your thinking with regard to partition assignments.Chris Matta
Yes Indeed, I need to force a given executor to get the same partition of all topics for integrety's sake. The executors will write to a cassandra database and since we will be in a parallelized context one executor might "collide" with another and therefore data will be lost, by assigning a partition I want to force the executor to process the data sequentially.Sami Ouassaidi

2 Answers

1
votes

Spark 2.x supports assigning partitions using the assign option, more information here.

Example:

Dataset<Row> ds2 = spark
  .readStream()
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "t0,t1")
  .option("assign", '{"t0": [0], "t1": [0]}')
  .load()
0
votes

Here's the answer I got from a KafkaRDD and DirectKafkaInputDStream contributor for those interested :

"Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness."

EDIT : So it worked quite well with a coalesce, I was able to find a solution to my problem : Altough not directly handling the executor a good roundaway was to assign the desired partition to a specific stream through assign strategy and coalesce to a single partition then repeat the same process for the remaining topics on different streams and at the end do a an union of these streams.

No shuffle was made during the whole thing since the rdd partitions were collapsed to a single one