Spark Kafka WordCount Python

Question

I've just started playing with apache spark and trying to get the kafka wordcount to work in python. I've decided to use python as its a language I'll be able to use for other big data tech and also DataBricks are offering their courses through spark.

My question: I'm running the basic wordcount example from here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py It seems to kick off and connect to the kafka logs but I can't see it actually produce a word count. I then added the below lines to write to a text file and it just produces a bunch of empty text file. It is connecting to the kafka topic and there is data in the topic, how can I see what its actually doing with the data if anything? Could it be a timing thing? Cheers.

Code for processing kafka data

                counts = lines.flatMap(lambda line: line.split("|")) \
                    .map(lambda word: (word, 1)) \
                    .reduceByKey(lambda a, b: a+b) \
                    .saveAsTextFiles("sparkfiles")

Data in Kafka topic

                    16|16|Mr|Joe|T|Bloggs

Colman Colman · Accepted Answer · 2015-05-14T01:18:06

Sorry, I was being an idiot. When I produced data to the topic while the spark app was running I can see the following in the output

                (u'a', 29)
                (u'count', 29)
                (u'This', 29)
                (u'is', 29)
                (u'so', 29)
                (u'words', 29)
                (u'spark', 29)
                (u'the', 29)
                (u'can', 29)
                (u'sentence', 29)

This represents how many times each word was represented in the block that was just processed by spark.

Spark Kafka WordCount Python

1 Answers