I'm using Spark's Streaming API, I just wanted to get a better understanding for how to best design the code.
I'm currently using Kafka Consumer (in pyspark) from pyspark.streaming.kafka.createDirectStream
According to http://spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Essentially, I want to apply a set of functions to each of the elements in the DStream. Currently, I'm using the "map" function for pyspark.streaming.DStream. According to documentation, my approach seems correct. http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream
map(f, preservesPartitioning=False) Return a new DStream by applying a function to each element of DStream.
Should I be using map, or would the right approach be to apply functions/transformations to the RDDs (Since DStream uses RDD)??
foreachRDD(func) Apply a function to each RDD in this DStream.
More Docs: http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html