Spark Structured streaming: multiple sinks

Question

We are consuming from Kafka using structured streaming and writing the processed data set to s3.

We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)
In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between addBatch and getBatch?

TriggerExecution - is it the time take to both process the fetched data and writing to the sink?

"durationMs" : {
    "addBatch" : 2263426,
    "getBatch" : 12,
    "getOffset" : 273,
   "queryPlanning" : 13,
    "triggerExecution" : 2264288,
    "walCommit" : 552
},

zsxwing zsxwing · Accepted Answer · 2017-08-14T21:22:23

Yes.

In Spark 2.1.1, you can use writeStream.foreach to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.
getBatch measures how long to create a DataFrame from source. This is usually pretty fast. addBatch measures how long to run the DataFrame in a sink.
triggerExecution measures how long to run a trigger execution, is usually almost the same as getOffset + getBatch + addBatch.

Spark Structured streaming: multiple sinks

1 Answers