Spark structured streaming aggregation without timestamp on data (aggregation based on trigger)

Question

I need to perform aggregation on incoming data based on spark driver timestamp, without watermark. My data doesn't have any timestamp field.

The requirement is: compute an average of the data received every sec (it doesn't matter when they have been send)

for example I need an aggregation on the data received for every trigger, just like the previous RDD streaming API.

is there a way to do that ?

Quentin Quentin · Accepted Answer · 2017-09-17T15:07:56

You can create your own sink and do your operation on each addBatch() call:

class CustomSink extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    data.groupBy().agg(sum("age") as "sumAge").foreach(v => println(s"RESULT=$v"))
  }
}

class CustomSinkProvider extends StreamSinkProvider with DataSourceRegister {
  def createSink(
                  sqlContext: SQLContext,
                  parameters: Map[String, String],
                  partitionColumns: Seq[String],
                  outputMode: OutputMode): Sink = {
    new PersonSink()
  }

  def shortName(): String = "person"
}

With outputMode set to Update and a trigger every X seconds

  val query = ds.writeStream
    .trigger(Trigger.ProcessingTime("1 seconds"))
    .outputMode(OutputMode.Update())
    .format("exactlyonce.CustomSinkProvider")

Spark structured streaming aggregation without timestamp on data (aggregation based on trigger)

2 Answers