Unable to read kafka using spark sql

Question

I am trying to read kafka using spark but facing some library related issue I guess .

I am pushing some event to kafka topics which I am able to read through kafka console consumer but unable to read through spark. I am using spark-sql-kafka library and the project is written in maven. Scala version is 2.11.12 and spark version is 2.4.3.

            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.4.3</version>
            <scope>provided</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>2.4.3</version>
            <scope>provided</scope>
        </dependency>

My java code is below : -

SparkSession spark = SparkSession.builder()
                .appName("kafka-tutorials")
                .master("local[*]")
                .getOrCreate();

        Dataset<Row> rows = spark.readStream().
                format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "meetup-trending-topics")
                .option("startingOffsets", "latest")
                .load();

        rows.writeStream()
                .outputMode("append")
                .format("console")
                .start();

        spark.streams().awaitAnyTermination();
        spark.stop();

Below is error message which I am getting:-

Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".; at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)

Solution:- Either of both 1)create uber jar or ii) --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 I was earlier giving --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 option after mainclass .

Possible duplicate of Resolving dependency problems in Apache Spark — user10938362
You use incompatible artifact versions (some compiled with Scala 2.11, other with 2.12) — user10938362

Marcin Marcin · Accepted Answer · 2019-06-21T13:21:06

This:

<scope>provided</scope>

means that you are responsible for providing the appropriate jar. I (and many others) prefer to avoid using this scope, and instead build an uberjar to deploy.

Unable to read kafka using spark sql

1 Answers