spark python in zeppelin aws error running program

Question

I had try example code about python in zeppelin web service spark aws emr and found error when running this code the output i expected is wordcount in afile in my s3 storage

text_file = sc.textFile("s3://mybuckettest2/Scenarios.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("s3://mybuckettest2/test.txt")

The error:

 Traceback (most recent call last):
  File "/tmp/zeppelin_python-2374039163027007666.py", line 319, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_python-2374039163027007666.py", line 307, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 1, in <module>
NameError: name 'sc' is not defined

i had try this code on aws emr hue web service. this code run successfully — Wahyudi

Lamanus Lamanus · Accepted Answer · 2019-08-01T11:14:59

I found this from the documentation.

SparkContext, SQLContext and ZeppelinContext are automatically created and exposed as variable names sc, sqlContext and z, respectively, in Scala, Python and R environments. Staring from 0.6.1 SparkSession is available as variable spark when you are using Spark 2.x.

It means that the sc is for scala and you have to use sqlContext for pyspark.

spark python in zeppelin aws error running program

1 Answers