0
votes

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.

  1. would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ? split steam vs single

    1. following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
  1. should I add rebalance after each uid if at all?
  2. what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
1

1 Answers

1
votes

You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though. rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.

In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage. There isn't right and wrong though, depends on your problem. Start simple and take it from there.

[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).