Spark context 'sc' not defined

24

votes

I am new to Spark and I am trying to install the PySpark by referring to the below site.

http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

I tried to install both prebuilt package and also by building the Spark package thru SBT.

When I try to run a python code in IPython Notebook I get the below error.

    NameError                                 Traceback (most recent call last)
   <ipython-input-1-f7aa330f6984> in <module>()
      1 # Check that Spark is working
----> 2 largeRange = sc.parallelize(xrange(100000))
      3 reduceTest = largeRange.reduce(lambda a, b: a + b)
      4 filterReduceTest = largeRange.filter(lambda x: x % 7 == 0).sum()
      5 

      NameError: name 'sc' is not defined

In the command window I can see the below error.

<strong>Failed to find Spark assembly JAR.</strong>
<strong>You need to build Spark before running this program.</strong>

Note that I got a scala prompt when I executed spark-shell command

Update:

With help of a friend I am able to fix the issue related to Spark assembly JAR by correcting the contents of .ipython/profile_pyspark/startup/00-pyspark-setup.py file

I have now only the problem of Spark Context variable. Changing the title to be appropriately reflect my current issue.

ipython-notebookpyspark

48

votes

you need to do the following after you have pyspark in your path:

from pyspark import SparkContext
sc =SparkContext()

13

votes

One solution is adding pyspark-shell to the shell environment variable PYSPARK_SUBMIT_ARGS:

export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

There is a change in python/pyspark/java_gateway.py , which requires PYSPARK_SUBMIT_ARGS includes pyspark-shell if a PYSPARK_SUBMIT_ARGS variable is set by a user.

12

votes

You have to creat instance of SparkContext like following:

import:

from pyspark import SparkContext

and then:

sc =SparkContext.getOrCreate()

NB:sc =SparkContext.getOrCreate() works well than sc =SparkContext().

5

votes

Just a little improvement. Add following at top of your python script file.

#! /bin/python
from pyspark import SparkContext, SparkConf
sc =SparkContext()

# your code starts here

4

votes

This worked for me in the spark version 2.3.1

from pyspark import SparkContext
sc = SparkContext()

2

votes

I added the below lines provided by Venu.

from pyspark import SparkContext
sc =SparkContext()

Then the below subsequent error was resolved by removing the Environment variable PYSPARK_SUBMIT_ARGS.

C:\Spark\spark-1.3.1-bin-hadoop2.6\python\pyspark\java_gateway.pyc in launch_gateway() 77 callback_socket.close() 78 if gateway_port is None: 
---> 79 raise Exception("Java gateway process exited before sending the driver its port number") 
80 
81 # In Windows, ensure the Java child processes do not linger after Python has exited. Exception: Java gateway process exited before sending the driver its port number

2

votes

I also encountered the Java gateway process exited before sending the driver its port number error message.

I could solve that problem by downloading one of the versions that are prebuilt for Hadoop (I used the one for hadoop 2.4). As I do not use Hadoop, I have no idea why this changed something, but it now works flawlessly for me...

2

votes

I was getting a similar error trying to get pySpark working via PyCharm, and I noticed in the log, just before this error I was getting this error:

env: not found

I traced this down to the fact that I did not have a Java home environment variable set.. so I added os.environ['JAVA_HOME'] = "/usr/java/jdk1.7.0_67-cloudera"

to my script ( I am aware that this is probably not the best place for it) and the error goes and I get my spark object created

1

votes

Spark on my Mac is 1.6.0 so adding pyspark-shell did not solve the problem. What worked for me is following the answer given here by @karenyng

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

1

votes

While working on IBM Watson Studio Jupyter notebook I faced a similar issue, I solved it by the following methods,

!pip install pyspark
from pyspark import SparkContext
sc = SparkContext()

0

votes

I had the same problem in my case problem was another notebook was running (in recent versions they are shown in green). I selected and shut down one of them and it worked fine.

Sorry for invoking old thread but it may help someone :)

0

votes

This script worked for me (in linux):

#!/bin/bash

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="--pylab -c 'from pyspark import SparkContext; sc=SparkContext()' -i"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

pyspark

To call pyspark as I'm calling there I'm assuming that "spark/bin" installation path is in the PATH variable. If not, call instead /path/to/spark/bin/pyspark.

0

votes

To the Exception: Java gateway process exited before sending the driver its port number

You need install Java8 in your computer.

Spark context 'sc' not defined

13 Answers