Kafka 0.8 official doc describes Kafka Consumer as follows:
"Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines. If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers."
I setup a Kafka cluster with Kafka 0.8.1.1 and use Spark Streaming job (spark 1.3) to pull data from its topics. The Spark Streaming code as follows:
... ...
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokerList);
kafkaParams.put("group.id", groupId);
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
messages.foreachRDD(new Function<JavaPairRDD<String, String>, Void>() {
@Override
public Void call(JavaPairRDD<String, String> rdd) throws Exception {
long msgNum = strJavaRDD.count();
System.out.println("There are " + msgNum + " messages read from Kafka.");
... ...
return null;}});
And then I submitted two Spark Streaming jobs to access the same topic with same group id. I assumed that when I send 100 messages to the topic, the two jobs totally get 100 message (e.g. job1 get 50 and job2 get 50; or job1 get 100 and job2 get 0). However, they get 100 respectively. Such a result seems different from what the Kafka doc said.
Is there anything with my code? Did I set the group id config correctly? Is this a bug or a design for createDirectStream()?
Test Env: Kafka 0.8.1.1 + Spark 1.3.1