Spark (pyspark) having difficulty calling statistics methods on worker node

Question

I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs.

On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems

keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}

but if I do the same directly on the RDD I hit issues

keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()

results in the following exception

Traceback (most recent call last):
  File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
    jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'

I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).

Install Details
- Max OSX Yosemite
- Spark spark-1.4.0-bin-hadoop2.6
- python is specified via spark-env.sh as
- PYSPARK_PYTHON=/usr/bin/python
- PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
- alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
- PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
- declare -x PYSPARK_DRIVER_PYTHON="ipython"

After a bit more digging, I see that this is bascally sc (spark context) is None, when the exceptions is thrown. Does this mean the worker nodes in pyspark do not have access to the sc variable? — Anthony Brew

Holden Holden · Accepted Answer · 2015-06-23T12:55:11

As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.

Spark (pyspark) having difficulty calling statistics methods on worker node

1 Answers