How to submit a python wordcount on HDInsight Spark cluster from Jupyter

Question

I am trying to run a python wordcount on a Spark HDInsight cluster and I'm running it from Jupyter. I'm not actually sure if this is the right way to do it, but I couldn't find anything helpful about how to submit a standalone python app on HDInsight Spark cluster.

The code :

import pyspark
import operator
from pyspark import SparkConf
from pyspark import SparkContext
import atexit
from operator import add
conf = SparkConf().setMaster("yarn-client").setAppName("WC")
sc = SparkContext(conf = conf)
atexit.register(lambda: sc.stop())

input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
words = input.flatMap(lambda x: x.split())
wordCount = words.map(lambda x: (str(x),1)).reduceByKey(add)

wordCount.saveAsTextFile("wasb:///example/outputspark")

And the error message I get and don't understand :

ValueError                                Traceback (most recent call last)
<ipython-input-2-8a9d4f2cb5e8> in <module>()
      6 from operator import add
      7 import atexit
----> 8 sc = SparkContext('yarn-client')
      9 
     10 input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")

/usr/hdp/current/spark-client/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    108         """
    109         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110         SparkContext._ensure_initialized(self, gateway=gateway)
    111         try:
    112             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/hdp/current/spark-client/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
    248                         " created by %s at %s:%s "
    249                         % (currentAppName, currentMaster,
--> 250                             callsite.function, callsite.file, callsite.linenum))
    251                 else:
    252                     SparkContext._active_spark_context = instance

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=yarn-client) created by __init__ at <ipython-input-1-86beedbc8a46>:7

Is it actually possible to run python job this way? If yes - it seems to be the problem with SparkContext definition... I tried different ways:

sc = SparkContext('spark://headnodehost:7077', 'pyspark')

and

conf = SparkConf().setMaster("yarn-client").setAppName("WordCount1")
sc = SparkContext(conf = conf)

but no success. What would be the right way to run the job or configure SparkContext?

maxiluk maxiluk · Accepted Answer · 2017-01-05T21:47:22

If you running from Jupyter notebook than Spark context is pre-created for you and it would be incorrect to create a separate context. To resolve the problem just remove the lines that create the context and directly start from:

input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")

If you need to run standalone program you can run it from command line using pyspark or submit it using REST APIs using Livy server running on the cluster.

How to submit a python wordcount on HDInsight Spark cluster from Jupyter

3 Answers