0
votes

I am trying to build data management (DM) solution involving high volume data ingestion, pass through some data domain rules, substitution (enrichment), flag the erroneous data before sending it off to downstream system. The rules checking & value replacement can be something simple like permissible threshold numeric values that the data elements should satisfy, to something more complex like lookup with master data for domain pool of values.

Do you think that Apache Flink can be a good candidate for such processing? Can there be flink operators defined to do lookup (with the master data) for each tuple flowing through it? I think there are some drawbacks of employing Apache Flink for the latter question - 1) the lookup could be a blocking operation that would slowdown the throughput, 2) checkpointing and persisting the operator state cannot be done if the operator functions have to fetch master data from elsewhere.

What are the thoughts? Is there some other tool best at the above use case?

Thanks

1

1 Answers

0
votes

The short answer is 'yes'. You can use Flink for all the things you mentioned, including data lookups and enrichment, with the caveat that you won't have at-most-once or exactly-once guarantees on side effects caused by your operators (like updating external state.) You can work around the added latency of external lookups with higher parallelism on that particular operator.

It's impossible to give a precise answer without more information, such as what exactly constitutes 'high volume data' in your case, what your per-event latency requirements are, what other constraints you have, etc. However, in the general sense, before you commit to using Flink you should take a look at both Spark Streaming and Apache Storm and compare. Both Spark and Storm have larger communities and more documentation, so it might save you some pain in the long rum. Tags on StackOverflow at the time of writing: spark-streaming x 1746, apache-storm x 1720, apache-flink x 421

More importantly, Spark Streaming has similar semantics to Flink, but will likely give you better bulk data throughput. Alternatively, Storm is conceptually similar to Flink (spouts/bolts vs operators) and actually has lower performance/throughput in most cases, but is just a more established framework.