Is there a way to use PySpark with Hadoop 2.8+?

Question

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.

PySpark versions seem to be aligned with Spark versions.

Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.

When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql

With the following java exceptions:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

Or:

java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation

I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.

But I have not any idea of how I could make it work? :)

Is the hadoop cluster also local? If it is, great. If not, the the philosophy is "bring the computation to the data" and not "bring the data to the computation" — Z4-tier

Artem Aliev Artem Aliev · Accepted Answer · 2020-03-24T15:46:21

Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.

Is there a way to use PySpark with Hadoop 2.8+?

2 Answers