I'm trying to use Jupyter, PySpark, and S3 files (via the s3a protocol) together. I need the org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider , as we need to use s3 session tokens. That was added to hadoop-aws 2.8.3+. I'm trying the following code:
import os
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:3.0.0 pyspark-shell'
spark = SparkSession.builder.appName('abc2').getOrCreate()
sc = spark.sparkContext
res = sc._jvm.java.lang.Class.forName("org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
This is failing with
Py4JJavaError: An error occurred while calling z:java.lang.Class.forName.
: java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
However, this class definitely exists in hadoop-aws 3.0.0.
The spark conf shows:
[('spark.driver.port', '34723'),
('spark.executor.id', 'driver'),
('spark.driver.host', 'HeartyX'),
('spark.jars',
'file:///home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,file:///home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar'),
('spark.submit.pyFiles',
'/home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,/home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar'),
('spark.repl.local.jars',
'file:///home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,file:///home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar'),
('spark.rdd.compress', 'True'),
('spark.serializer.objectStreamReset', '100'),
('spark.app.id', 'local-1542373156862'),
('spark.master', 'local[*]'),
('spark.submit.deployMode', 'client'),
('spark.app.name', 'abc2'),
('spark.ui.showConsoleProgress', 'true'),
('spark.files',
'file:///home/ashic/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.0.0.jar,file:///home/ashic/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar')]
So the jars are getting submitted.
On a standalone spark-without-hadoop (2.3.1) cluster with a hadoop 3.0.0 install, when using spark-submit on the command line, this works perfectly. However, in Jupyter Notebooks, it doesn't seem to find the class required, and as such that code, (and code which tries to read data from s3a://bucket/prefix/key) fails.
Any idea why --packages jars aren't visible in jupyter?
UPDATE
So, I tried simplifying. I created a conda env, installed pyspark 2.4.0 (python 3) through pip. Then tried:
pyspark --packages org.apache.hadoop:hadoop-aws:3.0.0
In the launched terminal, I tried the code above. When starting up, I see that it downloads the jars, but then it still doesn't find the class.
UPDATE 2
So, I copied the jars manually into /home/ashic/.conda/envs/pyspark/lib/python3.7/site-packages/pyspark/jars
, and ran pyspark on the command line. It "just worked". However, putting the jars into a folder and using --driver-class-path, or even --jars does not work. It looks like pyspark is not using the jars as expected.
hadoop-aws-3.0.0.jar
, or do I need all thehadoop-*-3.0.0.jar
files? – sid-kap