Jupyter notebook, pyspark, hadoop-aws issues

Question

I'm trying to use Jupyter, PySpark, and S3 files (via the s3a protocol) together. I need the org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider , as we need to use s3 session tokens. That was added to hadoop-aws 2.8.3+. I'm trying the following code:

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:3.0.0 pyspark-shell'
spark = SparkSession.builder.appName('abc2').getOrCreate()
sc = spark.sparkContext
res = sc._jvm.java.lang.Class.forName("org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")

This is failing with

Py4JJavaError: An error occurred while calling z:java.lang.Class.forName.
: java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

However, this class definitely exists in hadoop-aws 3.0.0.

The spark conf shows:

[('spark.driver.port', '34723'),
 ('spark.executor.id', 'driver'),
 ('spark.driver.host', 'HeartyX'),
 ('spark.jars',
  'file:///home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,file:///home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar'),
 ('spark.submit.pyFiles',
  '/home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,/home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar'),
 ('spark.repl.local.jars',
  'file:///home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,file:///home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.id', 'local-1542373156862'),
 ('spark.master', 'local[*]'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.name', 'abc2'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.files',
  'file:///home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,file:///home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar')]

So the jars are getting submitted.

On a standalone spark-without-hadoop (2.3.1) cluster with a hadoop 3.0.0 install, when using spark-submit on the command line, this works perfectly. However, in Jupyter Notebooks, it doesn't seem to find the class required, and as such that code, (and code which tries to read data from s3a://bucket/prefix/key) fails.

Any idea why --packages jars aren't visible in jupyter?

UPDATE

So, I tried simplifying. I created a conda env, installed pyspark 2.4.0 (python 3) through pip. Then tried:

pyspark --packages org.apache.hadoop:hadoop-aws:3.0.0

In the launched terminal, I tried the code above. When starting up, I see that it downloads the jars, but then it still doesn't find the class.

UPDATE 2

So, I copied the jars manually into /home/ashic/.conda/envs/pyspark/lib/python3.7/site-packages/pyspark/jars , and ran pyspark on the command line. It "just worked". However, putting the jars into a folder and using --driver-class-path, or even --jars does not work. It looks like pyspark is not using the jars as expected.

Quote: before SparkContext / SparkSession and corresponding JVM have been started. Could that be a problem? — 10465355
I'm setting the env-var before starting the spark session, and the context config shows the relevant jars in the list. So I'd say those are getting supplied. I'm wondering if the driver doesn't get access to stuff in packages (i.e. maybe only the executors do). — ashic
Which jars did you add to the folder? Was it just hadoop-aws-3.0.0.jar, or do I need all the hadoop-*-3.0.0.jar files? — sid-kap

stevel stevel · Accepted Answer · 2018-11-18T16:02:57

Mixing JARs across Hadoop versions is doomed to failure. Even once the hadoop-* JARs line up, you'll discover version problems. Getting classpaths right is one of the eternal pain points of the entire ASF big data stack

The easiest way is probably to copy the AWS class into your own library, fix it up until it works and run it against Hadoop 2.8.

You'll probably need to replace calls to S3AUtils.lookupPassword(conf, key, ...) with conf.getTrimmed(key, '") and it'll pick up the session secrets; the lookupPassword code is a bit more complex as it designed to handle secrets tucked away in encrypted JCEKS files.

Jupyter notebook, pyspark, hadoop-aws issues

UPDATE

UPDATE 2

1 Answers