I am trying to build data management (DM) solution involving high volume data ingestion, pass through some data domain rules, substitution (enrichment), flag the erroneous data before sending it off to downstream system. The rules checking & value replacement can be something simple like permissible threshold numeric values that the data elements should satisfy, to something more complex like lookup with master data for domain pool of values.
Do you think that Apache Flink can be a good candidate for such processing? Can there be flink operators defined to do lookup (with the master data) for each tuple flowing through it? I think there are some drawbacks of employing Apache Flink for the latter question - 1) the lookup could be a blocking operation that would slowdown the throughput, 2) checkpointing and persisting the operator state cannot be done if the operator functions have to fetch master data from elsewhere.
What are the thoughts? Is there some other tool best at the above use case?
Thanks