1
votes

I'm trying to understand the runtime aspect of Flink when dealing with multiple data streams and multiple operators per data stream.

Use case: N data streams in a single flink job (each data stream representing 1 device - with different time latencies), and each of these data streams gets split into two streams, of which one goes into a bunch of CEP operators, and one into a process function.

Questions:

  1. At runtime, will the engine create one thread per data stream? Or one thread per operator?
  2. Is it possible to dynamically create a data stream at runtime when the job starts? (i.e. if N is read from a file when the job starts and corresponding N streams need to be created)
  3. Are there any specific performance impacts when a large number of streams (N ~ 10000) are created, as opposed to N partitions within a single stream?
1

1 Answers

0
votes

Question # 1: At runtime, will the engine create one thread per data stream? Or one thread per operator?

In my opinion, there is no direct relation between streams and operators. you need to have a look at execution plans for getting the idea that how streams are connected to operators. Also it depends on parallelism, also there are various types of parallelisms in flink

Question # 2 : Is it possible to dynamically create a data stream at runtime when the job starts?

Yes, it is possible. the simple way is to create a stream generator which implements RichParallelSourceFunction function and send stream at some rate.

Question # 3: Are there any specific performance impacts when a large number of streams (N ~ 10000) are created, as opposed to N partitions within a single stream?

yes I think so, there will be some performance impacts, you can corroborate this by generating multiple streams and then comparing processing latency