3
votes

I followed this link to install Spark Standalone mode on a cluster by placing pre-built versions of spark on each node on the cluster and running ./sbin/start-master.sh on Master and ./sbin/start-slave.sh <master-spark-URL> on slave. How do I continue from there to setup a pyspark application, for example in ipython notebook to utilize the cluster? Do I need to install ipython on my local machine(laptop)?

1
I followed this blog post to set up ipython to work with pyspark. ramhiser.com/2015/02/01/… If you are using python3 then there is a slight change you'll have to make, let me know if you are interested in that and I can dig it up from my machine. After following the above steps you can launch ipython like - ipython --profile=pyspark Also, you'll have to run pyspark directly on the cluster master. As of today, it is not possible to run pyspark remotely (e.g., from your laptop) for a standalone cluster. - quantum_random
thank you @quantum_random, does running ipython on the master automatically distributes the job across the workers? - DevEx
No, by default it does not. You can use the --master option and specify the master url like spark://<master-url>:7077. This tells spark to use the entire cluster. Interestingly enough, I have not tried using ipython with the --master option so I don't know how well it works. - quantum_random

1 Answers

2
votes

To use ipython to run pyspark You'll need to set add the following environment variables in .bashrc

export PYSPARK_DRIVER_PYTHON=ipython2 # As pyspark only works with python2 and not python3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

This will cause ipython2 notebook to be launched when you execute pyspark from shell.

Note: I assume you already have ipython notebook installed. If not the easiest method is to use Anaconda python.

Reference: