Been playing with pyspark on juypter all day with no issues. Just by simply using the docker image juypter/pyspark-notebook
, 90% of everything I need is packaged (YAY!)
I would like to start exploring using GraphFrames, which sits on top of GraphX which sits on top of Spark. Has anyone gotten this combination to work?
Essentially, according to the documentation, I just need to pass "--packages graphframes:xxyyzz" when running pyspark to download and run graphframes. Problem is that juypter is already running as soon as the container comes up.
I've tried passing the "--packages" line as an environment variable (-e) for both JUYPTER_SPARK_OPTS and SPARK_OPTS when running docker run and that didn't work. I found that I can do pip install graphframes
from a terminal, which gets me part of the way -- the python libraries are installed, but the java ones are not "java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI".
The image specifics documentation does not appear to offer any insights on how to deploy a Spark Package to the image.
Is there a certain place to throw the graphframes .jar? Is there a command to install a spark package post-docker? Is there a magic argument to docker run
that would install this?
I bet there's a really simple answer to this --Or am I in high cotton here?
References: