Exec pyspark standalone script without spark-submit

Question

I am new to installing spark on a Linux machine and have possibly a basic question : I have installed Spark version 1.6.0 and Python 2.6.6

In spark interactive mode, I am able to run these simple commands to count lines in the README.md file.

However, I want to be able to create a standalone Python script and achieve the same result, but am getting errors.

My python code in test.py -

#!/usr/bin/python
import pyspark
from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
    conf = SparkConf().setAppName("word count").setMaster("local[3]")
    sc = SparkContext(conf = conf)


    rdd = sc.textFile("/opt/spark/README.md")

    print(rdd.count())

If I run this as -

spark-submit ./test.py

I get a correct result.

My question is, why can't I run this as just -

./test.py

since I am importing pyspark and SparkContext in my python script.

I get the error -

Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    sc = SparkContext(conf = conf)
  File "/usr/local/lib/python2.7/site-packages/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/lib/python2.7/site-packages/pyspark/context.py", line 188, in _do_init
    self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
TypeError: 'JavaPackage' object is not callable

I know I'm missing adding some jars somewhere, as per my google searches, but I don't think I understand whats exactly going on here. I would appreciate if someone could point me to a basic tutorial on how to setup spark variables and CLASSPATH.

I already read this question, but its not as much detail -

What is the difference between spark-submit and pyspark?

Thank you.

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-06-03T10:37:01

Let's focus on two pieces of information:

I have installed Spark version 1.6.0 and Python 2.6.6

self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)

These parts suggest that you misconfigured your Spark installation:

You believe to use Spark 1.6 (this seems to be the version of jars in your path).
Python package in the path uses code introduced in Spark 2.1.0 (SPARK-16861).

Most likely this is the result of incorrectly set PYTHONPATH or equivalent environment variable.

Exec pyspark standalone script without spark-submit

2 Answers