0
votes

Say I have a Flink application to filter, transform, and process stream.

How to break this application into two jobs and communicate b/w them without using the intermittent store.

Refer to the below image for dataflow.

Reason for the use case :

Event size : 2KB, Event lite : 200B, TPS: 1M

For effective usage of Java Heap to store more events at any given time transformation is required. Doing all three on single TaskManager has a disadvantage of storing the ingested events as well, where nearly 80% of events are not required.

Running these jobs on different task managers will give great flexibility in scaling the processing function.

Need help in achieving this, any suggestion is welcome. Also trying to understand how multiple jobs can be submitted via a single Flink Application.

enter image description here

1
Can you explain the motivation in more detail? I'm not seeing why any events would be stored in the task managers, if they are simply being transformed and filtered.David Anderson
M1: Find how to split the job to make TM have less GC. M2: Learn how to deploy multiple Flink jobs in a single App, i.e, deploy in application mode.ardhani

1 Answers

1
votes

Several points:

Application mode, introduced in Flink 1.11, allows a single main() method to submit multiple jobs, but there's no mechanism for direct communication between these jobs. Flink's approach to fault tolerance via snapshotting doesn't extend to managing state in more than one job.

You could, hypothetically, connect the jobs with a socket sink and socket source. But you'll give up fault tolerance if you do this.

You can achieve something similar to what you've asked for by configuring a slot sharing group that forces the final stage(s) of the pipeline into their own slot(s). However, this is almost certainly a bad idea, as it will force ser/de that might otherwise be unnecessary, and also result in poorer resource utilization. But it will separate those pipeline stages into another JVM.

If the goal is to have separately deployable and independently scalable components, you can get that by using remote functions with the Stateful Functions API.

To maximize performance (and minimize garbage collection) with the sort of ETL job you've shown, you're probably better off if you take advantage of operator chaining and object reuse, and keep everything in a single job.