Aggregate data from different micro batches in Spark streaming

Question

I am trying to aggregate and find some metrics using Spark streaming (Reading from Kafka) every minute. I am able to aggregate the data for that particular minute. How do I make sure I can have a bucket for current day and sum up all the aggregate values of all minutes in that day?

I have a data frame and I am doing something similar to this.

sampleDF = spark.sql("select userId,sum(likes) as total from likes_dataset group by userId order by userId")

Manish Saraf Bhardwaj Manish Saraf Bhardwaj · Accepted Answer · 2017-04-20T08:28:55

You can make the use of "Watermarking" feature from Structured Streaming Programming

Sample code

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

    val windowedCounts = words
        .withWatermark("timestamp", "10 minutes")
        .groupBy(
            window($"timestamp", "10 minutes", "5 minutes"),
            $"word")
        .count()

Aggregate data from different micro batches in Spark streaming

2 Answers