4
votes

You know how in Apache Storm you can have a Spout streaming data to multiple Bolts. Is there a way to do something similar in Apache Spark?

I basically want that there be one program to read data from a Kafka Queue and output it to 2 different programs, which can then process it in their own, different ways.

Specifically, there would be a reader program that would read data from the Kafka queue and output it to 2 programs x and y. x would process the data to calculate metrics of one kind (in my case it would calculate the user activities) whereas y would calculate metrics of another kind (in my case this would be checking activities based on different devices).

Can someone help me understand how this is possible in Spark?

1
Interesting question! Would you care to elaborate about your "2 different programs" output? - eliasah
So in general these could be any programs that need to read data from the same source. In my case, I want one program to be processing that data and calculating metrics whereas the other one should take that data and create a backup in s3. - Pravesh Jain
Can you edit your question with the additional information please? - eliasah
As for backing up in s3, that should be a regular save already available in Spark. And as the the other you might want to consider a connector to HBase or Elasticsearch. - eliasah
One more thing, if you don't try to be more specific about what you need. Your question will be flagged as broad and thus won't be resolved and maybe deleted even thought it's interesting. - eliasah

1 Answers

1
votes

Why don't you simply create two topologies?

  1. Both topologies have a spout reading from the kafka topic (yes, you can have multiple topologies reading from the same topic; I have this running on production systems). Make sure that you use different spout configs, otherwise kafka-zookeper will see both topologies as being the same. Have a look at the documentation here.

Spoutconfig is an extension of KafkaConfig that supports additional fields with ZooKeeper connection info and for controlling behavior specific to KafkaSpout. The Zkroot will be used as root to store your consumer's offset. The id should uniquely identify your spout.

public SpoutConfig(BrokerHosts hosts, String topic, String zkRoot, String id);
  1. Implement program x in topology x and program y in topology y.

Another option would have two graphs of bolts subscribing from the same spout, but IMHO this is not optimal, because failed tuples (which are likely to fail only in one single graph) would be replayed to both graphs, event if they fail in only one of the graphs; and therefore some kafka messages will end being processed twice, using tow separated topologies you avoid this.