0
votes

I'm using Spark's Streaming API, I just wanted to get a better understanding for how to best design the code.

I'm currently using Kafka Consumer (in pyspark) from pyspark.streaming.kafka.createDirectStream

According to http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Essentially, I want to apply a set of functions to each of the elements in the DStream. Currently, I'm using the "map" function for pyspark.streaming.DStream. According to documentation, my approach seems correct. http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream

map(f, preservesPartitioning=False) Return a new DStream by applying a function to each element of DStream.

Should I be using map, or would the right approach be to apply functions/transformations to the RDDs (Since DStream uses RDD)??

foreachRDD(func) Apply a function to each RDD in this DStream.

More Docs: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html

1

1 Answers

1
votes

DirectStream.map is a correct choice here. Following map:

stream.map(f)

is equivalent to:

stream.transform(lambda rdd: rdd.map(f))

DirectStream.foreachRDD from the other hand is an output action and creates an output DStream. Function you use with foreachRDD is not expected to return anything, same as the method itself. It is obvious when take a look at the Scala signature:

def foreachRDD(foreachFunc: RDD[T] => Unit): Unit