1
votes

I have Apache Toree installed following the instructions at https://medium.com/@faizanahemad/machine-learning-with-jupyter-using-scala-spark-and-python-the-setup-62d05b0c7f56.

However I do not manage to import packages in the pySpark kernel by using the PYTHONPATH variable in the kernel file at:

/usr/local/share/jupyter/kernels/apache_toree_pyspark/kernel.json.

Using the notebook I can see the the required .zip in the sys.path and in the os.environ[‘PYTHONPATH’], and the relevant .jar is at os.environ[‘SPARK_CLASSPATH'] the but I get

“No module named graphframe” when importing it with: import graphframe.

Any suggestion on how to get graphframe imported?

Thank you.

2

2 Answers

1
votes

I was using the .zip from the dataframes's download page but it does not solve the problem. The correct .zip can be created following the steps in:

https://github.com/graphframes/graphframes/issues/172

Another solution was given at: Importing PySpark packages, although the --packages parameter didn't work for me.

Hope this help.

0
votes

The quickest way to get a package like graphframes going in a Jupyter notebook, is by setting the PYSPARK_SUBMIT_ARGS environment variable - this can be done in a running notebook server like this:

import os
os.environ["PYSPARK_SUBMIT_ARGS"] = ("--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell")

Verify that it was added, before launching the SparkContext sc = pyspark.SparkContext()

environ{...
       'PYSPARK_SUBMIT_ARGS': '--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell'}

You might then find a tmp directory in your PATH. Check through import sys; sys.path which should say something like this:

[...
 '/tmp/spark-<###>//userFiles-<###>/graphframes_graphframes-0.7.0-spark2.4-s_2.11.jar',
 '/usr/local/spark/python',
 '/usr/local/spark/python/lib/py4j-0.10.7-src.zip', ...
]

This was tested with the jupyter/pyspark-notebook docker container, for which you can also set the environment variable at build time. Run docker build . with this Dockerfile to do so:

FROM jupyter/pyspark-notebook
USER root
ENV PYSPARK_SUBMIT_ARGS --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell
USER $NB_UID