1
votes

I have my own partitioning technique that generates keys for DataStream tuples ,those keys range are equal to the number of nodes in the clusters like if I set the parallelism equal to 4 the generated keys will be 0,1,2 and 3 and so on and then every key should be partitioned to the same node to do such more keyed processing using keyed state.

What happened: I have implemented my logic using the keyBy so I can use a keyed state but it suffers from a great skewness some of the nodes had received no records and other ones received more than one. I have tried to use custom partitioning it did the physical partitioning as I want but I can not use the keyed state with it without using keyBy.

Is there a way to control the skewness or enforce keys to be parallelized over the available nodes? or Is there a way to overwrite the partitioning technique used in keyBy? or Is there a way to use a keyed state with custom partitioning?

1

1 Answers

2
votes

As far as I know, there isn't a clean solution for situations like this where the size of the keyspace is (roughly) equal to the parallelism. One brute force approach that will work is to write your own KeySelector function and have it compute keys for each of the partitions in such a way that those keys belong to key groups that are assigned to distinct workers, but figuring out how to do this is not straightforward.

There has been discussion about doing this on the user mailing list.