I have a use case where i have to process the events in FIFO fashion. These are the events generated from machines. each machines generates one event per 30 sec. For particular machine we need to process the events based on FIFO fasion.
we need to process around 240 million events per day. For such a massive scale we need to use Kafka+Spark Streaming
From Kafka documentation i understand that we can use key field of message to route the message to particular topic partition. This ensures that i can use machine id as key and ensure that all messages from particular machine land into same topic partition.
50 % of problem solved.
Here comes the Question at processing side.
The spark Documentation of Kafka Direct approach says RDD partitions are equivalent to Kafka partitions.
So when i execute rdd.foreachPartition does task iterate in ordered fasion ?
Is it ensured that a partition of RDD always lies with in one executor?
Is it ensured that foreachPartition task is executed only by one thread for entire partition ?
Please help.