how to override libraries running spark in CDH

Question

I have CDH 5.7.0 with spark 1.6.0 and kafka 0.9.0 and I need to run a Spark streaming job that consumes messages from a kafka broker in another cluster with 0.8.2.2 version. I create a stream like:

val stream = KafkaUtils.createStream(ssc, Utils.settings.zookeeperQuorum, Utils.settings.kafkaGroup, Utils.settings.topicMapWifi)

In the build.sbt I'm adding:

libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.2.0"

with that library I would be using a client that fits a broker with version 0.8.2.x. But the problem is that Spark is loading a ton stuff from CDH claspath in:

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/bin/spark-class

and is adding a newer version of kafka client than the one I need. Is there a way to override specific libraries from code?

Vitalii Kotliarenko Vitalii Kotliarenko · Accepted Answer · 2016-04-25T07:23:02

You can edit spark-env.sh located under Spark config directory (/etc/spark/conf on Cloudera) and change

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark

to point your Spark instance. Alternatively, you can deploy your version of Spark and use Cloudera's Hadoop configuration (set HADOOP_CONF_DIR in your spark-env.sh to /etc/hadoop/conf). In this case you will be able to see application history, if your application set in configuration

spark.eventLog.dir=hdfs:/user/spark/applicationHistory

how to override libraries running spark in CDH

2 Answers