0
votes

I am new to spark and looking for the best practices in managing dependency jars

There are a couple of options I can think of

  1. Include everything (application and third party jars) in a fat jar

Pros: Dependencies are controlled through maven pom file, what dependency jars we use to compile, build and test go to different environments (QA/Prod etc)

Cons: Since it is a fat jar, maven repository fills up, takes time to build and push the jar from build machine to deployment machine, etc

  1. only application-related code resides in the jar and third party jars are exported as --conf spark.executor.extraClassPath=

Pros: application jar is small, easy to build and push from build to target environments

Cons: may lead to inconsistency between maven pom dependency list and list of jar names specified in classpath list, also need to make sure versions are intact

We are using Cloudera distribution and using Spark 2.3.0

Also in both cases, we do not need to include spark, Hadoop related jars by default would be available in spark executors so no need to transport it to executor every time we run spark application, is it right?

How do we know what are all the dependency jars by default will be available in (Cloudera) spark executor so that we do not export or include it in a fat jar

Is it good to keep third party jars in HDFS and export it in the classpath instead of keeping jars in the client/edge node and from there export it?

is there any best practices or recommendation? Any reference is appreciated?

1

1 Answers

0
votes

There are many questions here but I will try to cover them.

Also in both cases we do not need to include spark,hadoop related jars by default would be available in spark executors so no need to transport it to executor every time we run spark application, is it right?

All the needed jars for haddop etc are included in the Cloudera repositories in each node so you do not need to copy them or include them in your spark submit. The only thing that you may have to do is to define SPARK_HOME with the proper path for your cloudera distribution (in cloudera there is a distinction also between Spark 1.6 and 2.0+ so make sure you use the correct SPARK_HOME).

For example for CM 5.10 the Spark home to export for Spark 2 is:

export SPARK_HOME="/cloudera/parcels/SPARK2/lib/spark2"

how do we know what are all the dependency jars by default will be available in (cloudera) spark executor so that we do not export or include it in fat jar

You can go to the coresponding directory in cloudera where the jars are kept. You can check the existing jars with:

ls /cloudera/parcels/SPARK2/lib/spark2/jars

There is always the simple option to run something and if a jar is missing you will see it in the execution error.

it is good to keep third party jars in hdfs and export it in the classpath instead of keeping jars in client/edge node and from there export it?

It is almost a bad idea to add jars in the default claspath because the classpath is area with root access, so you will have to ask your admins to add the files there (which slows things down) and I am not sure what happens in a case of update to a later version. You can create an extra repository in all nodes where you can store all the extra jars that your application needs and create a simple sftp script to distribute all jars to that path in all machines.

Then add in your conf/spark-defaults.conf

spark.driver.extraClassPath /my_extra_jars_path/* 
spark.executor.extraClassPath /my_extra_jars_path/*

or on your spark submit add --jars option with the full path of all jars comma seperated.

The alternate to store the extra jars in Hdfs it would be very nice but I have not used it.

Between the two options I would advocate to not include all dependancies in the Jar and have a slow build time and distribution and have a light jar with only relevant code and manage dependencies with a simle sftp distribution script to copy the jars in all nodes in the dedicated directory (or in hdfs if this is possible).