1
votes

I am new to installing spark on a Linux machine and have possibly a basic question : I have installed Spark version 1.6.0 and Python 2.6.6

In spark interactive mode, I am able to run these simple commands to count lines in the README.md file.

However, I want to be able to create a standalone Python script and achieve the same result, but am getting errors.

My python code in test.py -

#!/usr/bin/python
import pyspark
from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
    conf = SparkConf().setAppName("word count").setMaster("local[3]")
    sc = SparkContext(conf = conf)


    rdd = sc.textFile("/opt/spark/README.md")

    print(rdd.count())

If I run this as -

spark-submit ./test.py 

I get a correct result.

95

My question is, why can't I run this as just -

./test.py 

since I am importing pyspark and SparkContext in my python script.

I get the error -

Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    sc = SparkContext(conf = conf)
  File "/usr/local/lib/python2.7/site-packages/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/lib/python2.7/site-packages/pyspark/context.py", line 188, in _do_init
    self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
TypeError: 'JavaPackage' object is not callable

I know I'm missing adding some jars somewhere, as per my google searches, but I don't think I understand whats exactly going on here. I would appreciate if someone could point me to a basic tutorial on how to setup spark variables and CLASSPATH.

I already read this question, but its not as much detail -

What is the difference between spark-submit and pyspark?

Thank you.

2

2 Answers

0
votes

Let's focus on two pieces of information:

  • I have installed Spark version 1.6.0 and Python 2.6.6

  • self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
    

These parts suggest that you misconfigured your Spark installation:

  • You believe to use Spark 1.6 (this seems to be the version of jars in your path).
  • Python package in the path uses code introduced in Spark 2.1.0 (SPARK-16861).

Most likely this is the result of incorrectly set PYTHONPATH or equivalent environment variable.

0
votes

user8371915- You pointed me in the right direction, it was an issue with PYTHONPATH not being set up at all.

I found this link to cover all the info I needed and was able to get my code to run with just -

./test.py

95

http://efimeres.com/2016/03/setup-spark-standalone/