predicate push to kafka with spark streaming. Filter what records to be read from kafka at kafka level

Question

We have requirement to process messages with spark streaming pulled from kafka. The kafka topic we are pulling messages from has multiple types of messages around 100 types. But we are interested in only about 15 types of messages.

Currently need to pull all the messages and apply filter option on RDD or Dataframe.

As lot of messages gets wasted at initial stage, is their a way where we stop kafka sending us those messages to spark streaming ? If it is possible we can have spark streaming running with lesser capacity of nodes.

We get around 100 K messages per minute out of which we process only 15k Messages.

Having separate topic cannot work for us because kafka and producer are managed by third party vendor.

nemeth.io nemeth.io · Accepted Answer · 2018-03-31T14:47:35

I see one solution to your problem with such special requirements:

Ask the third party vendor if it is possible to set the messageType as key. This might give you the possibility to filter only by "key" up front in your Spark app, without even parsing the "value" field within the Kafka message.

Further, this approach might also give you the opportunity to minimize the partitions you need to read from, since the same keys will go to the same partitions. This works under the following premises:

There is no custom partitioner in place
There is even more than 1 partition

predicate push to kafka with spark streaming. Filter what records to be read from kafka at kafka level

1 Answers