Apache Kafka Topic Partitioning

Question

I have a use case where I would be reading a set of key / value pairs, where key is just a String and value is a JSON. I have to expose these values as JSON's to a REST end-point which I would do using a kafka-streaming consumer.

Now my questions are:

How do I deal with Kafka partitions? I'm planning to use spark-streaming for the consumer
How about the producer? I would like to poll the data from an external service at a constant interval and write the resulting key / value pair to the Kafka topic. Is the a streaming producer?
Is this even a valid use case to employ Kafka? I mean, I could have another consumer group that just logs the incoming key / value pairs to a database. This is exactly what attracts me to use Kafka, the possibility to have multiple consumer groups to do different things!

Partitioning the topic I suppose is to increase parallelism, thereby increasing consumer throughput. How does this throughput compare with no partitioning? I have a use case where I have to ensure ordering, so I cannot partition the topic, but at the same time I would like to have a very high throughput for my consumer. How do I go about doing this?

Any suggestions?

user2720864 user2720864 · Accepted Answer · 2015-12-16T17:05:02

Just trying to share few thoughts on this

Topic is the main level of parallelism in Kafka. A topic having N partitions can be consumed by N number of threads in parallel. But having multiple partitions mainly creates problem in ordering of the data. E.g. If you have N no of partitions P and you configure your producer to publish messages randomly (default behaviour) then message M1 produced at time T1 might go to partition P1, message M2 @T2 to P2, M3 @T3 to P2 and then M4 to P1 again. You can configure custom rule to produced messages to specific partitions (using something called Key) but it requires to be handled at your end.

Not sure what exactly you mean regarding the producer. In general you can create observers to listen for those events and invoke producers when they arrive. You can choose to send message by batches as well.

One of the key reasons for choosing Kafka is the compatibility with different computations engine like apache storm, apache spark etc. But as far as my understanding goes the main thing Kafka aim for is high throughput expecting data to be published in a very frequent duration. If in your case the interval between events are high then it might worth thinking about other possibilities before finalising on Kafka as maintaining an idle cluster is not a good idea.

Apache Kafka Topic Partitioning

1 Answers