5
votes

With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark.

In the Kafka topic, I have json messages (pushed with Streamsets Data Collector). However, I am not able to read it with following code:

kafka=spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers","localhost:6667") \
.option("subscribe","mytopic").load()
msg=kafka.selectExpr("CAST(value AS STRING)")
disp=msg.writeStream.outputMode("append").format("console").start()

It generates this error :

 java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArrayDeserializer

I tried to add at the readStream line:

.option("value.serializer","org.common.serialization.StringSerializer")
.option("key.serializer","org.common.serialization.StringSerializer")

But it does not solve the problem.

Any idea ? Thank you in advance.

1
try this on org.apache.kafka.common.serialization.StringDeserializer for both key and value deserializer - Akash Sethi
take a help from here hope this is what you looking for https://github.com/akashsethi24/Spark-Kafka-Stream-Example/blob/master/src/main/scala/KafkaConsumer.scala - Akash Sethi
It tried but it does not help. Maybe because I am using Structured Streaming. - JS G.

1 Answers

6
votes

Actually I found the solution: I added the following jar in dependency:

spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar

(after having downloaded it from https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10-assembly_2.10/2.1.0)