Background: I'm using Spark Streaming to stream events from Kafka which are in the form of comma separated key value pairs Here is an example of how events are streamed into my spark application.
Key1=Value1, Key2=Value2, Key3=Value3, Key4=Value4,responseTime=200
Key1=Value5, Key2=Value6, Key3=Value7, Key4=Value8,responseTime=150
Key1=Value9, Key2=Value10, Key3=Value11, Key4=Value12,responseTime=100
Output:
I want to calculate different metrics (avg, count etc.) grouped by different keys in the stream for a given batch interval e.g.
- Average responseTime by Key1, Key2 (responseTime is one of the keys in every event)
- Count by Key1, Key2
My attempts so far:
val stream = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val pStream = stream.persist()
val events: DStream[String] = pStream.flatMap(_._2.split(","))
val pairs= events.map(data => data.split("=")).map(array => (array(0), array(1)))
// pairs results in tuples of (Key1, Value1), (Key2, Value2) and so on.
Update - 03/04 The keys Key1, Key2...can arrive out of order in the incoming stream.
Appreciate your inputs / hints.
DataFrames
? – zero323