1
votes

I am having trouble making a simple 'hello world' connection between pyspark and mongoDB (see example I am trying to emulate https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python). Can someone please help me understand and fix this issue?

Details:

I can successfully run the pyspark shell with the seen-below --jars --conf --py-files, then import pymongo_spark, and finally connect to the DB; however, when I try and print 'hello world' python is having trouble extracting files because of a permission denied '/home/ .cache' issue. I don't think our env settings are correct and I am not sure how to fix this...

(see attached error file screenshot)

My Analysis: It is not clear if this is a Spark/HDFS, pymongo_spark, or pySpark issue. Spark or PyMongo_spark seems to be defaulted to each nodes /home .cache

Here is my pyspark environment:

pyspark --jars mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar,mongo-java-driver-3.6.3.jar --driver-class-path mongo-java-driver-3.6.3.jar,mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar --master yarn-client --conf "spark.mongodb.input.uri=mongodb:127.0.0.1/test.coll?readPreference=primaryPreferred","spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" --py-files pymongo_spark.py

In 1: import pymongo_spark

In 2: pymongo_spark.activate()

In 3: mongo_rdd =sc.mongoRDD('mongodb://xx.xxx.xxx.xx:27017/test.restaurantssss')

In 4: print(mongo_rdd.first())

Error message --3

Error message --1

Error message --2

1
Can you add the bottom part of the stack trace? Seems to be no config to set that path, maybe looking at the class sources will give a hint...ernest_k
Hi Ernest, posting it nowBrian JButcher
thanks again Ernest - I posted the rest of the stack trace (error message..) does this help? The message duplicated many times - I just grabbed the first two duplicates.... pls let me know what you think or if you need any more info...Brian JButcher
I see in the stack trace a clear message reading Change your EGG cache to point to a different directory by setting the PYTHON_EGG_CACHE environment variable to point to an accessible variable. Can you try this on your cluster nodes?ernest_k

1 Answers

0
votes

We knew about the 'Change your EGG cache to point to a different directory by setting the PYTHON_EGG_CACHE environment variable to point to an accessible variable' but we were unsure on how to accomplish this.

We were trying to do this locally but we needed to change the reading and writing permissions (as the Hadoop user - not the local user) for each node

Set Hadoop-user PYTHON_EGG_CACHE == tmp

Then in the unix prompt:

export PYTHONPATH=/usr/anaconda/bin/python

export MONGO_SPARK_SRC=/home/arustagi/mongodb/mongo-hadoop/spark

export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python

Verify PYTHONPATH -bash-4.2$ echo $PYTHONPATH /usr/anaconda/bin/python:/home/arustagi/mongodb/mongo-hadoop/spark/src/main/python

Command to invoke PySpark

pyspark --jars /home/arustagi/mongodb/mongo-hadoop-spark-1.5.2.jar,/home/arustagi/mongodb/mongodb-driver-3.6.3.jar,/home/arustagi/mongodb/mongo-java-driver-3.6.3.jar --driver-class-path /home/arustagi/mongodb/mongo-hadoop-spark-1.5.2.jar,/home/arustagi/mongodb/mongodb-driver-3.6.3.jar,/home/arustagi/mongodb/mongo-java-driver-3.6.3.jar --master yarn-client --py-files /usr/anaconda/lib/python2.7/site-packages/pymongo_spark-0.1.dev0-py2.7.egg,/home/arustagi/mongodb/pymongo_spark.py

On pyspark console

18/04/06 15:21:04 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 Welcome to

  ____              __
 / __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/

/__ / .__/_,// //_\ version 1.6.0 //

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15) SparkContext available as sc, HiveContext available as sqlContext.

In [1]: import pymongo_spark

In [2]: pymongo_spark.activate()

In [3] : mongo_rdd = sc.mongoRDD('mongodb://xx.xxx.xxx.xx:27017/test.restaurantssss')

In[4]: print(mongo_rdd.first())

{u'cuisine': u'Italian', u'_id': ObjectId('5a9cd076219d0d1f1039de7f'), u'name': u'Vella', u'restaurant_id': u'41704620', u'grades': [{u'date': datetime.datetime(2014, 10, 1, 0, 0), u'grade': u'A', u'score': 11}, {u'date': datetime.datetime(2014, 1, 16, 0, 0), u'grade': u'B', u'score': 17}], u'address': {u'building': u'1480', u'street': u'2 Avenue', u'zipcode': u'10075', u'coord': [-73.9557413, 40.7720266]}, u'borough': u'Manhattan'}