6
votes

I have a partitioned topic, which has X partitions.

As of now, when producing messages, I create Kafka's ProducerRecord specifying only topic and value. I do not define a key. As far as I understand, my messages gonna be distributed evenly amongst partitions using default built-in partitioner. On the other hand, I have a thread pool of Kafka consumers. Each Kafka consumer will be running in its own dedicated thread consuming messages from the topic. Each of those consumers is given the same group.id. This will allow consuming messages in parallel. Every consumer will be assigned its fair share of partitions to read from.

I want my messages to be consumed in an orderly fashion. I know that Kafka guarantees the order of messages within a partition. So, as long as I come up with a proper key structure, I will have my messages partitioned in a way that they will end up in the same partition. In a way, message key groups messages and stores them in the partition.

Does it make sense?

Q: Is there a chance that due to a badly designed key I will get uneven partitions? One may receive way more records than the others. Can it impact in a bad way performance of my Kafka cluster? What are the best practices for message key design?

2

2 Answers

7
votes

Your understanding of default partitioner is correct.

When you don't have a requirement to consume some messages in the same order as they were produced then not specifying a key is the best option. If that is not your case, then your requirement tells you what must be your key. For instance if you want to preserve the order of produced messages for a given user, a user_id is potentially your message key.

To achieve a particular messages order you need to think how producers are configured. If your producers can retry sending a message in case of failure and in flight messages are higher than 1 then messages can be produced out of order.

You can get uneven partition by specifying bad key. For example, if 90% of your users are from New York and 10% from other cities and you choose a city as a key, then one of yours partition will be huge and one of the consumers overloaded (I assume that the number of messages per user is the same).

2
votes

Kafka will apply murmur hash on the key and modulo with number of partitions so it i.e. murmur2(record.key())) % num partitions. In all likely hood it should get evenly distributed in the case of default partitioning. I would suggest you to experiment all your key options with a simple murmur2 function written in java to see the distribution pattern and then make a choice. Also there are two implementations of default partitioning in kafka. Murmur hash implementation is in the newer version. Old legacy versions work differently.