0
votes

I installed PySpark and Ipython notebook in ubuntu 12.04.

After installing when I run the "ipython --profile=pyspark", it is throwing the following exception

ubuntu_user@ubuntu_user-VirtualBox:~$ ipython --profile=pyspark  
Python 2.7.3 (default, Jun 22 2015, 19:33:41) 
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

IPython profile: pyspark
Error: Must specify a primary resource (JAR or Python or R file)
Run with --help for usage help or --verbose for debug output
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    173             else:
    174                 filename = fname
--> 175             __builtin__.execfile(filename, *where)

/home/ubuntu_user/.config/ipython/profile_pyspark/startup/00-pyspark-setup.py in <module>()
      6 sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
      7 
----> 8 execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
      9 

/home/ubuntu_user/spark/python/pyspark/shell.py in <module>()
     41     SparkContext.setSystemProperty("spark.executor.uri", os.environ["SPARK_EXECUTOR_URI"])
     42 
---> 43 sc = SparkContext(pyFiles=add_files)
     44 atexit.register(lambda: sc.stop())
     45 

/home/ubuntu_user/spark/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    108         """
    109         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110         SparkContext._ensure_initialized(self, gateway=gateway)
    111         try:
    112             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/home/ubuntu_user/spark/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
    232         with SparkContext._lock:
    233             if not SparkContext._gateway:
--> 234                 SparkContext._gateway = gateway or launch_gateway()
    235                 SparkContext._jvm = SparkContext._gateway.jvm
    236 

/home/ubuntu_user/spark/python/pyspark/java_gateway.pyc in launch_gateway()
     92                 callback_socket.close()
     93         if gateway_port is None:
---> 94             raise Exception("Java gateway process exited before sending the driver its port number")
     95 
     96         # In Windows, ensure the Java child processes do not linger after Python has exited.


Exception: Java gateway process exited before sending the driver its port number

Below is the settings and configuration file.

ubuntu_user@ubuntu_user-VirtualBox:~$ ls /home/ubuntu_user/spark
bin          ec2       licenses  README.md
CHANGES.txt  examples  NOTICE    RELEASE
conf         lib       python    sbin
data         LICENSE   R         spark-1.5.2-bin-hadoop2.6.tgz

Below is the IPython setting

ubuntu_user@ubuntu_user-VirtualBox:~$ ls .config/ipython/profile_pyspark/
db              ipython_config.py           log  security
history.sqlite  ipython_notebook_config.py  pid  startup

IPython and Spark(PySpark) Configuration

ubuntu_user@ubuntu_user-VirtualBox:~$ vi .config/ipython/profile_pyspark/ipython_notebook_config.py

# Configuration file for ipython-notebook.

c = get_config()

# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770


ubuntu_user@ubuntu_user-VirtualBox:~$ vi .config/ipython/profile_pyspark/startup/00-pyspark-setup.py
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Setting the following environment variables in .bashrc or .bash_profile:

ubuntu_user@ubuntu_user-VirtualBox:~$ vi .bashrc 
export SPARK_HOME="/home/ubuntu_user/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"

I am new for apache spark and IPython. How to solve this issue?

3

3 Answers

0
votes

I had the same exception when my virtual machine doesn't have enough memory for Java. So I allocated more memory for my virtual machine and this exception goes away.

Steps: Shut down the VM -> VirtualBox Settings -> "System" tab -> Set the memory

(However, this may be only a workaround. I guess the correct way to fix this exception might be properly configuring Spark in terms of java memory.)

0
votes

May be there is an error locating the pyspark shell by the spark.

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

This will work for Spark 1.6.1. If you have a different version try locating the .zip file and adding the path to the extract.

0
votes

Two thoughts: Where is your JDK? I don't see a JAVA_HOME parameter configured in your file. That might be enough given:

Error: Must specify a primary resource (JAR or Python or R file)

Second, Make sure your port 7770 is open and available to your JVM.