2
votes

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.

PySpark versions seem to be aligned with Spark versions.

Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.

When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql

With the following java exceptions:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

Or:

java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation

I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.

But I have not any idea of how I could make it work? :)

2
Is the hadoop cluster also local? If it is, great. If not, the the philosophy is "bring the computation to the data" and not "bring the data to the computation"Z4-tier
@Z4-tier: Yes it is.Axel Borja

2 Answers

0
votes

Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.

0
votes

Ok, I found a solution:

1 - Install Hadoop in the expected version (2.8.5 for me)

2 - Install a Hadoop Free version of Spark (2.4.4 for me)

3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.

(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)

4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

(Note that the py4j version my differs)

That's it.