I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs)
from pyspark.mllib.stat
in a .mapValues
operation on my RDD containing (key, list(int)) pairs.
On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems
keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}
but if I do the same directly on the RDD I hit issues
keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()
results in the following exception
Traceback (most recent call last):
File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'
I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).
- Install Details
- Max OSX Yosemite
- Spark spark-1.4.0-bin-hadoop2.6
- python is specified via spark-env.sh as
PYSPARK_PYTHON=/usr/bin/python
PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
- alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
- PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
- declare -x PYSPARK_DRIVER_PYTHON="ipython"
sc
(spark context) isNone
, when the exceptions is thrown. Does this mean the worker nodes in pyspark do not have access to thesc
variable? – Anthony Brew