24
votes

I am new to Spark and I am trying to install the PySpark by referring to the below site.

http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

I tried to install both prebuilt package and also by building the Spark package thru SBT.

When I try to run a python code in IPython Notebook I get the below error.

    NameError                                 Traceback (most recent call last)
   <ipython-input-1-f7aa330f6984> in <module>()
      1 # Check that Spark is working
----> 2 largeRange = sc.parallelize(xrange(100000))
      3 reduceTest = largeRange.reduce(lambda a, b: a + b)
      4 filterReduceTest = largeRange.filter(lambda x: x % 7 == 0).sum()
      5 

      NameError: name 'sc' is not defined

In the command window I can see the below error.

<strong>Failed to find Spark assembly JAR.</strong>
<strong>You need to build Spark before running this program.</strong>

Note that I got a scala prompt when I executed spark-shell command

Update:

With help of a friend I am able to fix the issue related to Spark assembly JAR by correcting the contents of .ipython/profile_pyspark/startup/00-pyspark-setup.py file

I have now only the problem of Spark Context variable. Changing the title to be appropriately reflect my current issue.

13

13 Answers

48
votes

you need to do the following after you have pyspark in your path:

from pyspark import SparkContext
sc =SparkContext()
13
votes

One solution is adding pyspark-shell to the shell environment variable PYSPARK_SUBMIT_ARGS:

export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

There is a change in python/pyspark/java_gateway.py , which requires PYSPARK_SUBMIT_ARGS includes pyspark-shell if a PYSPARK_SUBMIT_ARGS variable is set by a user.

12
votes

You have to creat instance of SparkContext like following:

import:

from pyspark import SparkContext

and then:

sc =SparkContext.getOrCreate()

NB:sc =SparkContext.getOrCreate() works well than sc =SparkContext().

5
votes

Just a little improvement. Add following at top of your python script file.

#! /bin/python
from pyspark import SparkContext, SparkConf
sc =SparkContext()

# your code starts here
4
votes

This worked for me in the spark version 2.3.1

from pyspark import SparkContext
sc = SparkContext()
2
votes

I added the below lines provided by Venu.

from pyspark import SparkContext
sc =SparkContext()

Then the below subsequent error was resolved by removing the Environment variable PYSPARK_SUBMIT_ARGS.

C:\Spark\spark-1.3.1-bin-hadoop2.6\python\pyspark\java_gateway.pyc in launch_gateway() 77 callback_socket.close() 78 if gateway_port is None: 
---> 79 raise Exception("Java gateway process exited before sending the driver its port number") 
80 
81 # In Windows, ensure the Java child processes do not linger after Python has exited. Exception: Java gateway process exited before sending the driver its port number
2
votes

I also encountered the Java gateway process exited before sending the driver its port number error message.

I could solve that problem by downloading one of the versions that are prebuilt for Hadoop (I used the one for hadoop 2.4). As I do not use Hadoop, I have no idea why this changed something, but it now works flawlessly for me...

2
votes

I was getting a similar error trying to get pySpark working via PyCharm, and I noticed in the log, just before this error I was getting this error:

env: not found

I traced this down to the fact that I did not have a Java home environment variable set.. so I added os.environ['JAVA_HOME'] = "/usr/java/jdk1.7.0_67-cloudera"

to my script ( I am aware that this is probably not the best place for it) and the error goes and I get my spark object created

1
votes

Spark on my Mac is 1.6.0 so adding pyspark-shell did not solve the problem. What worked for me is following the answer given here by @karenyng

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
1
votes

While working on IBM Watson Studio Jupyter notebook I faced a similar issue, I solved it by the following methods,

!pip install pyspark
from pyspark import SparkContext
sc = SparkContext()
0
votes

I had the same problem in my case problem was another notebook was running (in recent versions they are shown in green). I selected and shut down one of them and it worked fine.

Sorry for invoking old thread but it may help someone :)

0
votes

This script worked for me (in linux):

#!/bin/bash

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="--pylab -c 'from pyspark import SparkContext; sc=SparkContext()' -i"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

pyspark

To call pyspark as I'm calling there I'm assuming that "spark/bin" installation path is in the PATH variable. If not, call instead /path/to/spark/bin/pyspark.

0
votes

To the Exception: Java gateway process exited before sending the driver its port number

You need install Java8 in your computer.