I am new to spark and looking for the best practices in managing dependency jars
There are a couple of options I can think of
- Include everything (application and third party jars) in a fat jar
Pros: Dependencies are controlled through maven pom file, what dependency jars we use to compile, build and test go to different environments (QA/Prod etc)
Cons: Since it is a fat jar, maven repository fills up, takes time to build and push the jar from build machine to deployment machine, etc
- only application-related code resides in the jar and third party jars are exported as --conf spark.executor.extraClassPath=
Pros: application jar is small, easy to build and push from build to target environments
Cons: may lead to inconsistency between maven pom dependency list and list of jar names specified in classpath list, also need to make sure versions are intact
We are using Cloudera distribution and using Spark 2.3.0
Also in both cases, we do not need to include spark, Hadoop related jars by default would be available in spark executors so no need to transport it to executor every time we run spark application, is it right?
How do we know what are all the dependency jars by default will be available in (Cloudera) spark executor so that we do not export or include it in a fat jar
Is it good to keep third party jars in HDFS and export it in the classpath instead of keeping jars in the client/edge node and from there export it?
is there any best practices or recommendation? Any reference is appreciated?