0
votes

I use spark(3.0.0) structured streaming to read topic from kafka.

I've used joins and then used mapGropusWithState to get my stream data, so I have to use update mode, based on my understanding from the spark offical guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

Below section of the spark offical guide says nothing about DB sink, and It does not support write to files either for update mode: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

Currently I output it to console, and I would like to to store the data in files or DB.

So my question is: how can I write the stream data to db or file in my situation? Do i have to write the data to kafka and then use kafka connect to read them back to files/db?

p.s. I followed the articles to get the aggregated streaming query.

- https://stackguides.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
- will also try one more time for below using java api
(https://stackguides.com/questions/50933606/spark-streaming-select-record-with-max-timestamp-for-each-id-in-dataframe-pysp)
1
Cant you use JDBC to write to DB. jdbcDF.write .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "schema.tablename") .option("user", "username") .option("password", "password") .save()skjagini
what is database flavor?Alex Ott
@Alex Ott, can be in memory one -H2soMuchToLearnAndShare

1 Answers

0
votes

I got confused by the OUTPUT and WRITE. Also I was wrongly assuming the DB and FILE Sink are in parallel term in the OUTPUT SINK section of the doc(and so one cannot see DB sink in the OUTPUT SINKs section of the guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks).

I just realized that the OUTPUT mode (append/update/complete) is to do query streaming query constraints. But it has nothing to do with how to WRITE to the SINK. I also realized the DB writing can be achieved by using the FOREACH SINK (initially I just understood it is for extra transformation).

I found these articles/discussions are useful

so later on, read the official guide again, confirmed the for each batch can also do custom logic etc when WRITING to a STORAGE.