FIFO processing using Spark Streaming?

Question

I have a use case where i have to process the events in FIFO fashion. These are the events generated from machines. each machines generates one event per 30 sec. For particular machine we need to process the events based on FIFO fasion.

we need to process around 240 million events per day. For such a massive scale we need to use Kafka+Spark Streaming

From Kafka documentation i understand that we can use key field of message to route the message to particular topic partition. This ensures that i can use machine id as key and ensure that all messages from particular machine land into same topic partition.

50 % of problem solved.

Here comes the Question at processing side.

The spark Documentation of Kafka Direct approach says RDD partitions are equivalent to Kafka partitions.

So when i execute rdd.foreachPartition does task iterate in ordered fasion ?

Is it ensured that a partition of RDD always lies with in one executor?

Is it ensured that foreachPartition task is executed only by one thread for entire partition ?

Please help.

zsxwing zsxwing · Accepted Answer · 2017-02-14T23:19:50

Let's say you don't use any operators that repartition the data (e.g., repartition, reduceByKey, reduceByKeyAndWindow, ...).

So when i execute rdd.foreachPartition does task iterate in ordered fasion ?

Yes. It processes the data following the order in the Kafka partition.

is it ensured that a partition of RDD always lies with in one executor?

Yes. There is only one executor (task) processing a partition if you don't enable speculation. speculation may launch another task to run the same partition if it's too slow.

is it ensured that foreachPartition task is executed only by one thread for entire partition ?

Yes. It processes the data in one partition one by one.

FIFO processing using Spark Streaming?

2 Answers