0
votes

I am trying to to put kafka-data through storm in hdfs and hive. I am working with hortonworks. Therefore i have the following structure, as (a little modificated) seen in many tutorials (http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/):

 TopologyBuilder builder = new TopologyBuilder();

 builder.setSpout("kafka-spout", kafkaSpout);

 builder.setBolt("hdfs-bolt", hdfsBolt).globalGrouping("kafka-spout");

 builder.setBolt("parse-bolt", new ParseBolt()).globalGrouping("kafka-spout");

 builder.setBolt("hive-bolt", hiveBolt).globalGrouping("parse-bolt");  

I send the kafka-spout data directly to hdfs-bolt, which is working when i only use hdfs-bolt. When i add the parse-bolt to parse the kafka-data and emit it to hive-bolt, the complete system goes crazy. Even when iam just sending one single message over kafka, this message is duplicated by the kafka-spout infinite times and is written to the hdfs infinite.

If there is an error in the parse-bolt, shouldn't the hdfs-bolt still working normal? I'am new to the topic, can someone see a simple beginners mistake? I am grateful for any advice.

1

1 Answers

0
votes

Are you acking the messages at the end of both bolt's execution?

When you read from the same stream from your kafka-spout, messages will get anchored to the same spout but with unique messageIds. So essentially even though your parse-bolt 's tuple fails, since it is anchored to the same spout, it will get replayed at the spout . This will result in another tuple with a different messageId but same content being played for all the bolts subscribed to it, in your case the parse-bolt and the hdfs-bolt. Remember that the replay happens at the Spout and hence everything subscribed to that stream from the spout will get redundant messages.