1
votes

I am using Spark Structured Streaming to read from Kafka topic.

Without any partition, Spark Structired Streaming consumer can read data.

But when I added partitions to topic, the client is showing messages from last partition only. I.e. if there are 4 partitions in topic and I.am pushing numbers like 1,2,3,4 in topic,then client printing only 4 not other values.

I am using latest samples and binaries from Spark Structured Streaming website.

    DataFrame<Row> df = spark
 .readStream()
 .format("kafka") 
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") 
.option("subscribe", "topic1") 
.load()

Am I missing anything?

1
Pushing messages how? Including the key? How do you know those messages are even going to other partitions? Also, Spark will not automatically pick up new partitions until you restart the app - OneCricketeer
I am using Kafka console producer for pushing data manually in topic. - Ashish Nijai
Sure, but are you using --parse-keys=true? If not, how are you checking which partitions that your messages are going into? - OneCricketeer
I am unable to check for any particular partition. I have 4 partitions.If I am sending 4 messages to topic, spark consumer is able to print only 4th message. The consumer is printing messages only in table of 4 i.e. 4th,8th, 12th messages. - Ashish Nijai
You can use GetOffsetShell of Kafka to list the latest offsets of each partition. That'll tell you if messages are being sent to any/all partitions... Otherwise, if you have only one Spark executor, then it'll only consume from one Kafka partition, so you'll need to have more - OneCricketeer

1 Answers

0
votes

Issue is resolved by changing kafka-clients-0.10.1.1.jar to kafka-clients-0.10.0.1.jar.

Found reference here Spark Structured Stream get messages from only one partition of Kafka