I use spark(3.0.0) structured streaming to read topic from kafka.
I've used joins
and then used mapGropusWithState
to get my stream data, so I have to use update mode, based on my understanding from the spark offical guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Below section of the spark offical guide says nothing about DB sink
, and It does not support write to files
either for update mode
: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
Currently I output it to console
, and I would like to to store the data in files or DB.
So my question is: how can I write the stream data to db or file in my situation? Do i have to write the data to kafka and then use kafka connect to read them back to files/db?
p.s. I followed the articles to get the aggregated
streaming query.
- https://stackguides.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
- will also try one more time for below using java api
(https://stackguides.com/questions/50933606/spark-streaming-select-record-with-max-timestamp-for-each-id-in-dataframe-pysp)