0
votes

I have just started working with kafka, and started playing with kafka in my local machine. I am able to produce and consume messages using python kafka client provided by confluent.

What my understanding is so far,

  1. kafka client (i.e both consumer and producers) maintains a queue of messages.
  2. Producers, will store every produced message in local buffer queue. And producers need to explicitly push messages from local buffer to kafka cluster.
  3. On Consumer's end, it will somehow fetch messages from kafka cluster and store it in a local buffer queue and then pull these messages from buffer via api calls such as poll()

First of all, am I missing here ?

Also, I frequently come across a phrase "commiting messages/offsets" at both clients. What exactly does commiting messages/offsets to kafka mean?

1

1 Answers

1
votes

Producers, will store every produced message in local buffer queue. And producers need to explicitly push messages from to kafka cluster.

You can configure how big that batch is, the acknowledgment, etc. but you do not need to "explicitly push" after calling producer.produce(...)

On Consumer's end, it will somehow fetch messages from kafka cluster and store it in a local buffer queue and then pull these messages from buffer via api calls such as poll()

You are correct.

First of all, am I missing here ?

No, these concepts are important to understand for tuning/performance and you should take some care in how you configure your producers and consumers, but they should not impact how you code your producers/consumers (e.g. don't do any tricks in timing your cosumer.poll()).

On the producing side, there is a notion of transactions, if you want to back out of publishing a message (this is to support two phase commits) but I am not sure how this works with the Python API. Beyond that, you should consider messages sent when you call producer.produce(). You should not need to call producer.flush() after every producer.produce()

On the consuming side, there is the notion of commits and offsets. At a high level you can think of it this way: when you poll() you are asking for messages form the API, but for ease of understanding, you are asking for messages from the topic. Now as you work your way through messages in the topic, you may want some way to let Kafka know that you have already read and processed some messages, and would not like to see them again.

Each message has a offset in the topic (this is actually in the partition, but we can keep it simple and say topic). You can think of this offset as the position of the message in the topic (first message 0, second message 1, so on and so forth, each message added has a higher offset than the last). When you explicitly call commit, you actually committing the offset and telling Kafka that you have read all of the messages up to that point and don't want to see them again.