we are attempting to design an ETL solution by leveraging spring cloud dataflow.
The requirements are mainly to:
- Query an external source database with read-only access that could be as large as ~400k rows
- Perform minimal transformation / Data quality
- Upload/sink to a postgres data mart using jdbc sink. (truncate the table prior to upload daily)
- Run this every 24 hours
Some challenges we faced:
We tried using JDBC-source starter app to connect to source databases,
however there are restrictions to databases that we are accessing from, as they are owned by another department, so the jdbc.update feature to update the source database to mark a row as "seen" could not be used.Is there a recommended way to query and page through a large resultset without running out of memory? The jdbc.max-rows-per-poll option seems not suitable in this situation.
We prefer not to use Tasks / Batch as the streams pipeline is much more powerful and cleaner (data can flow through the stream and transformed). Have anyone successfully used SCDF streams to achieve a similar goal or is it simply not designed for this use case?