I am new to installing spark on a Linux machine and have possibly a basic question : I have installed Spark version 1.6.0 and Python 2.6.6
In spark interactive mode, I am able to run these simple commands to count lines in the README.md file.
However, I want to be able to create a standalone Python script and achieve the same result, but am getting errors.
My python code in test.py -
#!/usr/bin/python
import pyspark
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("word count").setMaster("local[3]")
sc = SparkContext(conf = conf)
rdd = sc.textFile("/opt/spark/README.md")
print(rdd.count())
If I run this as -
spark-submit ./test.py
I get a correct result.
95
My question is, why can't I run this as just -
./test.py
since I am importing pyspark and SparkContext in my python script.
I get the error -
Traceback (most recent call last):
File "./test.py", line 8, in <module>
sc = SparkContext(conf = conf)
File "/usr/local/lib/python2.7/site-packages/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/usr/local/lib/python2.7/site-packages/pyspark/context.py", line 188, in _do_init
self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
TypeError: 'JavaPackage' object is not callable
I know I'm missing adding some jars somewhere, as per my google searches, but I don't think I understand whats exactly going on here. I would appreciate if someone could point me to a basic tutorial on how to setup spark variables and CLASSPATH.
I already read this question, but its not as much detail -
What is the difference between spark-submit and pyspark?
Thank you.