2
votes

Our production encounter one Kafka Event Consumption phenomenon. Total event Volume is 3.4 Billion Events with 40 partitions. And event message almost even distribute on each partition with 80+ million event per partition.

And we allocate 40 consumer streams with 40 threads (BTW, we 're using kafka client 0.8.2).

During Consumption periods, on the first 4 hours each partition's lag keep drop down. On the last one hour, 2/3 consumer streams have finished event consumption. Only less than 10 consumer stream continue to receive remaining events. For the related less than 10 partition's lag ranges on 2-3 millions. It means consumer pool usage gradually back to idle while waiting for remaining few consumer finish tasks.

Assume cpu cores & memory space & network bandwidth is enough, any tips to ensure Kafka consumer can finish overall consumption within one time and no more consumer fall behind (Except for enlarge partition number).

1
Did you have a look at what queuing theory says for your case. Is the long tail to be expected or do you have a statistical artifact in your partitioning?Harald
Long tail is not what we expected. Ideally, we want to each consumer stream finish in one time or within minor time gap between.Brian Ling
What I mean with 'expecting' is: given your partition distribution is uniform and your compute time per message has a certain distribution, Normal or Poisson or something, then it may be the case that the long tail is exactly what must happen with high probability.Harald

1 Answers

0
votes

This is a real cool one. I suspect there is no Kafka solution, except the one you mention (increase # partitions).

Hmm, a totally silly idea could be as follows: during end-game, switch to copying into a new topic with less partitions, e.g. 37 (relatively prime to 40) in the hope to reshuffle all messages such that they are again evenly distributed over the 37 partitions. Of course this assumes a lot: a) copying is faster than processing, b) there is really no way to increase your initial 40 partitions, c) you've got the space and resources for the additional topic.