1
votes

Been playing with pyspark on juypter all day with no issues. Just by simply using the docker image juypter/pyspark-notebook, 90% of everything I need is packaged (YAY!)

I would like to start exploring using GraphFrames, which sits on top of GraphX which sits on top of Spark. Has anyone gotten this combination to work?

Essentially, according to the documentation, I just need to pass "--packages graphframes:xxyyzz" when running pyspark to download and run graphframes. Problem is that juypter is already running as soon as the container comes up.

I've tried passing the "--packages" line as an environment variable (-e) for both JUYPTER_SPARK_OPTS and SPARK_OPTS when running docker run and that didn't work. I found that I can do pip install graphframes from a terminal, which gets me part of the way -- the python libraries are installed, but the java ones are not "java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI".

The image specifics documentation does not appear to offer any insights on how to deploy a Spark Package to the image.

Is there a certain place to throw the graphframes .jar? Is there a command to install a spark package post-docker? Is there a magic argument to docker run that would install this?

I bet there's a really simple answer to this --Or am I in high cotton here?

References:

1

1 Answers

1
votes

So the answer was quite simple:

From the gist here, we need to simply tell juypter to add the --packages line to the SPARK_SUBMIT with something like this to the top of my notebook. Spark goes out and installs the package when grabbing the context:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 pyspark-shell'

Keep a watch on the versions available at the graphframes package, which for now, means graphframes 0.8.1 on spark 3.0 on scala 2.12.