how to tell spark and zeppelin to use local maven .m2 directory in AWS EMR?

Question

I have created an aws AMI which contain a local maven repository, and it is locate in /usr/local/

I then use that AMI to create AWS EMR cluster with spark and zeppelin.

when I use pyspark --packages to import jar packages, the EMR instance creates a .ivy directory in /home/hadoop. Zeppellin will creates a directory with an ID as a name in /var/lib/zeppelin/local-repo

how do I point pyspark, spark and zeppelin to use my local maven repository (/usr/local/.m2/repository) instead of create a .ivy directory and download the jars from mave central?

I know I can use pyspark --jars /local/path/to/jar.jar to import the jar from local path and copy to .ivy directory, but I rather spark and zeppelin to use my local maven repository.

Also if I set spark.driver.extraClassPath and spark.executor.extraClassPath /usr/local/.m2/repository/* in spark-default.conf, will spark able to look for the jars in those directory (as the inisde directory does not contain .jar striaght away, eg /usr/local/.m2/repository/groupId/artifactId/version/name.jar)

Mátyás Kuti-Kreszács Mátyás Kuti-Kreszács · Accepted Answer · 2019-08-26T18:50:23

You should be able to load dependencies dynamically like:

%spark.dep

// add maven repository
z.addRepo("RepoName").url("RepoURL")

// add maven snapshot repository
z.addRepo("RepoName").url("RepoURL").snapshot()

// add credentials for private maven repository
z.addRepo("RepoName").url("RepoURL").username("username").password("password")

// add artifact from filesystem
z.load("/path/to.jar")

// add artifact from maven repository 
z.load("groupId:artifactId:version")

Check the documentation for more details: https://zeppelin.apache.org/docs/latest/interpreter/spark.html#3-dynamic-dependency-loading-via-sparkdep-interpreter

how to tell spark and zeppelin to use local maven .m2 directory in AWS EMR?

1 Answers