Run Apache Spark on Hadoop 2.0.0-cdh4.4.0

Question

I have a cluster with Hadoop 2.0.0-cdh4.4.0, and I need to run Spark on it with YARN as resource manager. I got following information from http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version

You can enable the yarn profile and optionally set the yarn.version property if it is different from hadoop.version. Spark only supports YARN versions 2.2.0 and later.

I don't want to upgrade the whole Hadoop package to support YARN version 2.2.0, as my HDFS have massive data and upgrade it will cause too long break of service and be too risky to me.

I think the best choice to me may be use a higher version of YARN than 2.2.0 while keep the version of other parts of my Hadoop unchanged. If it's the way, what steps should I follow to get such a YARN package and to deploy it on my cluster?

Or are there other approach to run Spark on Hadoop 2.0.0-cdh4.4.0 with YARN as resource manager?

Mass Dosage Mass Dosage · Accepted Answer · 2016-03-11T14:54:52

While you could theoretically upgrade just your YARN component my experience suggests that you run a large risk of library and other component incompatibilities if you do that. Hadoop consists of a lot of components but they're generally not as decoupled as they should be, which is one of the main reasons CDH, HDP and other Hadoop distributions bundle only certain versions known to work together and if you have commercial support with them but change the version of something they generally won't support you because things tend to break when you do this.

In addition, CDH4 reached End of Maintenance last year and is no longer being developed by Cloudera so if you find anything wrong you're going to find it hard to get fixes (generally you'll be told to upgrade to a newer version). I can also speak from experience that if you want to use newer versions of Spark (e.g. 1.5 or 1.6) then you also need a newer version of Hadoop (be it CDH, HDP or another one) as Spark has evolved so quickly and YARN support was bolted on later so there are loads of bugs and issues in earlier versions of both Hadoop and Spark.

Sorry, I know it's not the answer you're looking for but upgrading Hadoop to a newer version is probably the only way forward if you actually want stuff to work and don't want to spend a lot of time debugging version incompatibilities.

Run Apache Spark on Hadoop 2.0.0-cdh4.4.0

1 Answers