1
votes

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.

For example, consider a process with an input/preprocessing stage, a business logic and an output stage. In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.

Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)? If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted? If no, does anyone have experience with such a setup and point me to a possible solution?

Thank you!

2

2 Answers

1
votes

If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.

Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.

I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.

0
votes

An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.

I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...

You can use a broadcast stream to communicate/update the dynamic portions of the processing.

Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.