3
votes

I am using pyspark under ubuntu with python 2.7 I installed it using

pip install pyspark --user 

And trying to follow the instruction to setup spark cluster

I can't find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark

I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?

3
I don't know if pyspark downloads all of Spark, sets up Java for you, and all that prerequisites... Did you try to search your OS disk for that file, though? - OneCricketeer
Yes i did. Pyspark is able to connect to a master and be a worker. But how do i set up a server? - thebeancounter

3 Answers

5
votes

https://pypi.python.org/pypi/pyspark

The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.

2
votes

Well i did a bit of a mix-up in the op.

You need to get spark on the machine that should run as master. You can download it here

After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.

please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.

After that, on the worker nodes, use the start-slave.sh script to start worker nodes.

And you are good to go, you can use a spark context inside python to use it!

1
votes

If you are already using pyspark through conda / pip installation, there's no need to install Spark and setup environment variables again for cluster setup.

For conda / pip pyspark installation is missing only 'conf', 'sbin' , 'kubernetes', 'yarn' folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).