0
votes

I have been using pyspark [ with python 2.7] in an ipython notebook on Ubuntu 14.04 quite successfully by creating a special profile for spark and starting the notebook by calling $ipython notebook --profile spark. The mechanism for creating the spark profile is given on many websites but i have used the one given in here.

and the $HOME/.ipython/profile_spark/startup/00-pyspark-setup.py contains the following code

import os
import sys
# Configure the environment
if 'SPARK_HOME' not in os.environ:
    os.environ['SPARK_HOME'] = '/home/osboxes/spark16'
# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))

I have just created a new VM of Ubuntu 16.04 for my students where I want them to run pyspark programs in ipython notebook. Python, Pyspark is working quite well. We are using Spark 1.6.

However I have discovered that the current versions of ipython notebook [ or jupyter notebook ] whether downloaded through Anaconda or installed with sudo pip install ipython .. DO NOT SUPPORT the --profile option and all configuration parameters have to be specified in the ~/.jupyter/jupyter_notebook_config.py file.

Can someone please help me with the config parameters that I need to put into this file? Or is there an alternative solution? I have tried the findshark() explained here but could not make it work. Findspark got installed but findspark.init() failed, possibly because it was written for python 3.

My challenge is that everything is working just fine on my old installation of ipython on my machine but my students who are installing everything from scratch cannot get pyspark going on their VMs.

3

3 Answers

1
votes

i work with spark just for test purpose locally from ~/apps/spark-1.6.2-bin-hadoop2.6/bin/pyspark

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"   ~/apps/spark-1.6.2-bin-hadoop2.6/bin/pyspark
0
votes

I have found a ridiculously simple answer to my own question by looking at the advice given in this page.

forget about all configuration files etc. Simply start notebook with this command -- $IPYTHON_OPTS="notebook" pyspark

thats all.

Obviously the paths to SPARK have to set as given here. and if you get an error with Py4j then look at this page.

With this you are good to go. The spark context is available at sc so don't import it again

0
votes

With Python 2.7.13 from Anaconda 4.3.0 and Spark 2.1.0 on Ubuntu 16.04:

$ cd
$ gedit .bashrc

Add following lines (where "*****" is the proper path):

export SPARK_HOME=*****/spark-2.1.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

Save, then do:

$ *****/anaconda2/bin/.pip install py4j
$ cd
$ source .bashrc

Check if it works with:

$ ipython
In [1]: import pyspark

For more details go here