We are consuming from Kafka using structured streaming and writing the processed data set to s3.
We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)
In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between
addBatch
andgetBatch
?TriggerExecution - is it the time take to both process the fetched data and writing to the sink?
"durationMs" : { "addBatch" : 2263426, "getBatch" : 12, "getOffset" : 273, "queryPlanning" : 13, "triggerExecution" : 2264288, "walCommit" : 552 },
11
votes
1 Answers
12
votes
Yes.
In Spark 2.1.1, you can use
writeStream.foreach
to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.htmlOr you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.
getBatch
measures how long to create a DataFrame from source. This is usually pretty fast.addBatch
measures how long to run the DataFrame in a sink.triggerExecution
measures how long to run a trigger execution, is usually almost the same asgetOffset
+getBatch
+addBatch
.