Using Scala kernel with Spark

Question

I have a problem with accessing data from S3 from Spark. I have spylon-kernelinstalled for JupyterHub (which is Scala kernel with Spark framework integrtation). It uses pyspark. Unfortunately the newest pyspark still uses hadoop-2.7.3 libraries. When I'm trying to access S3 bucket in Frankfurt region I get following Java exception:

"com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: xxxxxxxxxx, AWS Error Code: null, AWS Error Message: Bad Request"

From my research it looks like it's hadoop 2.7.3 problem. With newer versions (3.1.1) it works well locally but pyspark uses those hadoop 2.7.3 jars and looks like it can't be changed. Can I do something about it? Maybe there is some way to tell pyspark to use hadoop 3.1.1 jars? Or maybe there is other Scala kernel with Spark for Jupyterhub which uses spark-shell instead of pyspark?

konradx95 konradx95 · Accepted Answer · 2020-04-25T12:19:26

Ok, I finally fixed it... I will post an answer, maybe it will be useful for someone.

pip install toree

jupyter toree install --spark_home /path/to/your/spark/ --interpreters=Scala

This one works! :)

Using Scala kernel with Spark

1 Answers