2
votes

I'm running spark in standalone mode, in Windows 8 using anaconda 3.5, ipython notebook.

The specification, I'm trying to create the environment is the following:

import os
import sys
import numpy
spark_path = "D:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path


sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")


from pyspark import SparkContext
from pyspark import SparkConf

sc = SparkContext("local", "test")

When I'm trying to run the following code:

rdd = sc.parallelize([1,2,3])
rdd.count() 

it's giving me error:

Python in worker has different version 3.4 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I tried this

import os

os.environ["SPARK_HOME"] = "/usr/local/Cellar/apache-spark/2.1.0/" ## Exact anaconda path in "program files"

And I tried this

But both could not solve my problem. Can someone please help me to resolve the issue? I'm bit non-technical in terms of computer system configuration.

Thanks a lot!

1
For starters, Spark 1.5 is quite old by Spark standards; AFAIK the 1.x branch is pretty much stalled on 1.6.3 -- while the 2.x branch is currently at 2.1.1 (and moving fast...) - Samson Scharfrichter
Disclaimer: I'm not familiar with the legacy iPython -- only with Jupyter and the way it uses "kernel" configurations for Python / Spark / whatever, in JSON files (just like the one shown in your link). Therefore I don't fully understand what you are trying to do with that Python code that attempts to configure Spark the hard way. But the error message suggests that you should set os.environ['PYSPARK_PYTHON'] = '/wheverer/is/your/anaconda/python3.5' (yes, the full path to the same executable you are using to run that script... which is not the default python in your PATH, clearly) - Samson Scharfrichter
@SamsonScharfrichter: Thanks for your comment. I tried 2.1.0 as well. But still getting this error. I'm specifying these environment variables in the jupyter notebook, rather than in the system. Also, I tried "PYSPARK_PYTHON" option. But still it didn't work. - Beta
A kernel is not the "system", it's just a configuration file to start a specific run-time environment... Setting the "system" for Spark would be in $SPARK_HOME/conf/spark-env.sh with an export PYSPARK_PYTHON=/wheverer/is/your/anaconda/python3.5 - Samson Scharfrichter
@SamsonScharfrichter: THanks Samson for your answer! I'll test it and let you know. You can put your comment in the answer. I'll marked it answered. - Beta

1 Answers

1
votes

First of all, if you are working with Spark, I would like to suggest to use a Virtualbox and install ubuntu 14.04 LTS or CentOs! Even if you are using it standalone developing applications with a windows backend going to be MUCH MORE HARDER! Non the less, if you try to connect to a HIVE metastore / hadoop from win it is nearly impossubru...

We had the same problem with cloudera manager the solution was to parse the same version of anaconda on ALL nodes, and change the PATH variable in .bashrc

I think it is better to set the variables outside of jupiter! Try to reconfigure your path environment in windows for python and spark!