1
votes

I use maven to manage my project. And I do add

  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
  <version>2.2.0</version>

to the maven dependencies

Below is my pom.xml

  <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>***</groupId>
  <artifactId>***</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>

    <plugins>         
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.11</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>2.11.8</scalaVersion>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>2.0.2</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>

    <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.6</version>
        <groupId>org.apache.maven.plugins</groupId>
        <configuration>
            <appendAssemblyId>false</appendAssemblyId>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
                <manifest>
                    <mainClass>***</mainClass>
                </manifest>
            </archive>
        </configuration>
        <executions>
            <execution>
                <id>make-assembly</id>
                <phase>package</phase>
                <goals>
                    <goal>single</goal>
                </goals>
            </execution>
        </executions>
    </plugin>
    </plugins>

 </build>


  <dependencies>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>


    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.scalaj</groupId>
      <artifactId>scalaj-http_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

  </dependencies>
</project>

I package everything using:

mvn clean package

I submit my job locally by typing:

spark-submit --class ... <path to jar file> <arguments to run the main class>

But I will get an error saying:Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

I know I can fix this problem by adding --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 after the spark-submit.

But how can I modify my pom to advoid doing that? The thing is in my maven repo, I can see spark-sql-kafka-0-10_2.11-2.2.0.jar has been downloaded. Then why I need to add the dependency mannually during the spark submit? I feel like there might be some error in my pom.xml even though I use the assembly to build my jar.

Hope someone can help me out!

1
Have you tried peeking in the uber JAR to see if the sql-kafka jar was appended to it? Also, I'd follow the answer here: stackoverflow.com/questions/39132906/…Yuval Itzchakov

1 Answers

0
votes

Finally I sloved my problem. I changed my pom.xml as follows:

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>***</groupId>
  <artifactId>***</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      <spark.version>2.2.0</spark.version>
  </properties>

    <dependencies>

  <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>${spark.scope}</scope>
  </dependency>

  <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>${spark.scope}</scope>
  </dependency>
  <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
      <version>${spark.version}</version>
  </dependency>

  </dependencies>

<profiles>
    <profile>
        <id>default</id>
        <properties>
            <profile.id>dev</profile.id>
            <spark.scope>compile</spark.scope>
        </properties>
        <activation>
            <activeByDefault>true</activeByDefault>
        </activation>
    </profile>
    <profile>
        <id>test</id>
        <properties>
            <profile.id>test</profile.id>
            <spark.scope>provided</spark.scope>
        </properties>
    </profile>
    <profile>
        <id>online</id>
        <properties>
            <profile.id>online</profile.id>
            <spark.scope>provided</spark.scope>
        </properties>
    </profile>
</profiles>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.0.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.11</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <scalaVersion>2.11.8</scalaVersion>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <appendAssemblyId>false</appendAssemblyId>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>***</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>

    </build>

</project>

Basically I added a profiles section and add scope to each dependency. Then instead of using mvn clean package I used mvn clean install -Ponline -DskipTests. And suprisingly, everything works perfect.

I am not quite clear about the details why this method work, but from the jar file I can see that the jar created by mvn clean package include lots of folders while the other method only includes a few. Maybe there are some conflict between folders in the first method. I don't know, hope some experienced people can explain this.