0
votes

We are using Kafka and Spark streaming to process trade data. We receive data from Kafka in avro format [key, byte[]]. We deserialize the data and send it further for processing. We are using DStreams in spark streaming application. We have requirement where we need to partition the data based on the key in the received avro record. So whenever, we receive data from kafka in form of stream, it should send the record to specified executor.

There are 10 different types of keys possible which we receive from Kafka. So All records with key1 should go to Node1, key2 should go to Node2 etc.

As the received stream data, we map to RDD but not pairRDD.

Please let us know if we can configure Paritioning based on Key of received record from kafka.

2

2 Answers

3
votes

If I had this requirement, I would first keep a few concepts in mind-

  1. Kafka distributes msgs based on their keys i.e. all msgs with same key go into the same topic partition.
  2. Spark Kafka connector gives one consumer per partition per group.id
  3. Spark logic can't be written for a specific node since there is no specific allocation upfront.

What this basically means that your data is already split as per your needs (key) and is already going to specific nodes, just that you don't have much control on this node allocation.

Here is what I would have done.

Since Kafka is key controlled, 1st make a Kafka dstream RDD which connects your spark nodes with Kafka partitions. Now what you need to do is identify which key this consumer is attached to.

Write the logic of spark job to sub-divide logic based on the key received by the node, its easy to to found with very first record. Then the control should be sent to the sub-logic which processes that particular key's logic. You have to do this check only first time or spend a few nanoseconds in this check.

If you still want to get more control maybe think of simple micro-services instead of Spark streaming.

0
votes

What Abhishek has told is correct.While sending data you will need to use a partitioner based on keys and have sufficient number of partitions so that data belonging to each key will end up in one partition. Use the direct stream approach in spark stream as this will spawn up the required number of consumers that are needed to server the partitions in the kafka topic. Each partition in spark will have data that is present in the corresponding partition of kafka. But we cannot specify that a particular partition should be processed by a specific node.