0
votes

Say a consumer does a time intensive processing. In order to scale consumer side processing, i would like to spawn multiple consumers and consumer messages from kafka topic in a round robin fashion. Based on the documentation, it seems like if i create multiple consumers and add them in one consumer group, only one consumer will get the messages. If i add consumers to different consumer groups, each consumer will get the same message. So, in order to achieve the above objective, is the only solution to partition the topic ? This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design. Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer. Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.

Second if i choose "cleanup.policy" to compact, does it mean that kafka log will keep increasing as it will maintain the latest value for each key? If not, how can i get log deletion and compaction?

UPDATE: It seems like i have two options to achieve scalability on consumer side, which are independent of topic scaling.

  1. Create consumer groups and have them consume odd and even offsets. This logic would have to be built into the consumers to discard un-needed messages. Also doubles the network requirements

  2. Create a hierarchy of topics, where the root topic gets all the messages. Then some job classifies the logs and publish them again to more fine grained topics. In this case, the strong ordering can be achieved at root and more fine grained topics for consumer scaling can be constructed.

In 0.8, kafka maintains the consumer offset, so publishing messages in a round robin across various consumers is not a too far fetched requirement from their design.

3

3 Answers

2
votes

Partitions are the unit of parallelism in Kafka by design. Not just for consumtion but kafka distributes the partiotions accross cluster which has different other benifits like sharing load among different servers, replication management for ensuring no Data loss, managing log to scale beyond a size that will fit on a single server etc.

Ordering of messages is a key factor as if you do not need a storng ordering then diving topics with multiple partitions will allow you to evenly distribute the load while producing (this will be handled by the producer itself). And while using consumer group you just need to add more consumer instances in the same group in order to consume them parallely.

Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.

True,from the doc

However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.

Maintaining ordering whiile consuming in distributed manner requires the messaging system to maintain per-message state to keep track of message acknowledgement. But this will involve a lot of expensive random I/O in the system. So clearly there is a trade-off.

Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer

Distributing messages across partitions is typically handled by the producer it self without any intervention from the programmers end (assuming you don't want to categories messages using key). And for the consumers as you just mentioned here the better choice would be to use Simple/Low level consumers which will allow you to consume only a subset of the partitions in a topic.

This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design

I believe for a system like Kafka which focuses on high throughput ( handle hundreds of megabytes of reads and writes per second from thousands of clients ), ensuring scalability and strong durability and fault-tolerance guarantees might not be a good fit for someone having totally a different business requirements.

0
votes

Topic partitioning is primarily a way to scale out consumers and brokers so if you need many consumers to keep up then you need to partition the topic and add multiple consumer instances in the same consumer group. The producer API will manage partitions transparently. If you need to have certain consumers subscribing only to some partitions, then you need to use the simple consumer API instead of the high level API and in this case you don't have the consumer group concept and have to coordinate consumption yourself.

Message ordering is guaranteed within partitions but not between partitions so if this is a requirement it needs to be dealt with on consumer side.

Setting cleanup.policy=compact means that the Kafka brokers will keep the latest version of a message key indefinitely and use cases like that should be more for recording of data updates for things you intend to keep around rather than the log stream buffering use case.

0
votes

You need to factor out the reading of Kafka messages from the subsequent processing of those messages. You can use partitions and consumer groups to make reading messages as fast as possible, but if you process the messages as part of your consumer logic then you'll just slow down your consumers. By streaming the messages from consumers to other classes that will perform your processing you can adjust the parallelism of the consumers and of the processors independently. You'll see this approach in technologies like Spark and Storm.

This approach does add one complication and that is that the consumer has to commit the message offset before the message has been processed. You may have to track the messages in flight to insure execute-exactly-once.