Data Modeling with Kafka? Topics and Partitions

Question

One of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".

I've read and watched some introductory materials. In particular, take, for example, Kafka: a Distributed Messaging System for Log Processing, which writes:

"a Topic is the container with which messages are associated"
"the smallest unit of parallelism is the partition of a topic. This implies that all messages that ... belong to a particular partition of a topic will be consumed by a consumer in a consumer group."

Knowing this, what would be a good example that illustrates how to use topics and partitions? When should something be a topic? When should something be a partition?

As an example, let's say my (Clojure) data looks like:

{:user-id 101 :viewed "/page1.html" :at #inst "2013-04-12T23:20:50.22Z"}
{:user-id 102 :viewed "/page2.html" :at #inst "2013-04-12T23:20:55.50Z"}

Should the topic be based on user-id? viewed? at? What about the partition?

How do I decide?

Strange this talks about topics and partitions, but not necessarily evolution of the data within them. What if you wanted to attach user agents or headers to those "user view" events? How do you evolve and communicate that in a way to downstream consumers? — OneCricketeer
@OneCricketeer Sounds like a separate question to me :) Go for it... — David J.

Lundahl Lundahl · Accepted Answer · 2013-06-20T13:57:03

When structuring your data for Kafka it really depends on how it´s meant to be consumed.

In my mind, a topic is a grouping of messages of a similar type that will be consumed by the same type of consumer so in the example above, I would just have a single topic and if you´ll decide to push some other kind of data through Kafka, you can add a new topic for that later.

Topics are registered in ZooKeeper which means that you might run into issues if trying to add too many of them, e.g. the case where you have a million users and have decided to create a topic per user.

Partitions on the other hand is a way to parallelize the consumption of the messages. The total number of partitions in a broker cluster need to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature. Consumers in a consumer group will split the burden of processing the topic between themselves according to the partitioning so that one consumer will only be concerned with messages in the partition itself is "assigned to".

Partitioning can either be explicitly set using a partition key on the producer side or if not provided, a random partition will be selected for every message.

Data Modeling with Kafka? Topics and Partitions

4 Answers