We have requirement to process messages with spark streaming pulled from kafka. The kafka topic we are pulling messages from has multiple types of messages around 100 types. But we are interested in only about 15 types of messages.
Currently need to pull all the messages and apply filter option on RDD or Dataframe.
As lot of messages gets wasted at initial stage, is their a way where we stop kafka sending us those messages to spark streaming ? If it is possible we can have spark streaming running with lesser capacity of nodes.
We get around 100 K messages per minute out of which we process only 15k Messages.
Having separate topic cannot work for us because kafka and producer are managed by third party vendor.