I am having trouble making a simple 'hello world' connection between pyspark and mongoDB (see example I am trying to emulate https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python). Can someone please help me understand and fix this issue?
Details:
I can successfully run the pyspark shell with the seen-below --jars --conf --py-files
, then import pymongo_spark, and finally connect to the DB; however, when I try and print 'hello world' python is having trouble extracting files because of a permission denied '/home/ .cache'
issue. I don't think our env settings are correct and I am not sure how to fix this...
(see attached error file screenshot)
My Analysis: It is not clear if this is a Spark/HDFS, pymongo_spark, or pySpark issue. Spark or PyMongo_spark seems to be defaulted to each nodes /home .cache
Here is my pyspark environment:
pyspark --jars mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar,mongo-java-driver-3.6.3.jar --driver-class-path mongo-java-driver-3.6.3.jar,mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar --master yarn-client --conf "spark.mongodb.input.uri=mongodb:127.0.0.1/test.coll?readPreference=primaryPreferred","spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" --py-files pymongo_spark.py
In 1: import pymongo_spark
In 2: pymongo_spark.activate()
In 3: mongo_rdd
=sc.mongoRDD('mongodb://xx.xxx.xxx.xx:27017/test.restaurantssss')
In 4: print(mongo_rdd.first())
PYTHON_EGG_CACHE
environment variable to point to an accessible variable. Can you try this on your cluster nodes? – ernest_k