We are consuming from Kafka using structured streaming and writing the processed data set to s3.
We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)
In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between
addBatchandgetBatch?TriggerExecution - is it the time take to both process the fetched data and writing to the sink?
"durationMs" : { "addBatch" : 2263426, "getBatch" : 12, "getOffset" : 273, "queryPlanning" : 13, "triggerExecution" : 2264288, "walCommit" : 552 },
11
votes
1 Answers
12
votes
Yes.
In Spark 2.1.1, you can use
writeStream.foreachto write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.htmlOr you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.
getBatchmeasures how long to create a DataFrame from source. This is usually pretty fast.addBatchmeasures how long to run the DataFrame in a sink.triggerExecutionmeasures how long to run a trigger execution, is usually almost the same asgetOffset+getBatch+addBatch.