2
votes

I have an application that receives much of its input from a stream, but some of its data comes from both a RDBMS and also a series of static files.

The stream will continuously emit events so the flink job will never end, but how do you periodically refresh the RDBMS data and the static file to capture any updates to those sources?

I am currently using the JDBCInputFormat to read data from the database.

Below is a rough schematic of what I am trying to do:

Flink Schematic

2
I don't really have a CDC process to pick the data up and it isn't a massive amount of information. I am trying to write it in a map function with a local cache to see how that works.mransley

2 Answers

0
votes

For each of your two sources that might change (RDBMS and files), create a Flink source that uses a broadcast stream to send updates to the Flink operators that are processing the data from Kafka. Broadcast streams send each Object to each task/instance of the receiving operator.

0
votes

For each of your sources, files and RDBMS, you can create a snapshot in HDFS or in a storage periodically(example at every 6 hours) and calculate the difference between to snapshots.The result will be push to Kafka. This solution works when you can not modify the database and files structure and an extra information(ex in RDBMS - a column named last_update).

Another solution is to add a column named last_update used to filter data that has changed between to queries and push the data to Kafka.