Can I retrieve the latest available offset for a Kafka partition without retrieving all the messages?

Question

Looking at the latest (v0.10) Kafka Consumer documentation:

"The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives data calls poll(long) and receives messages."

Is there a way to query for the largest offset available for the partition on the server side, without retrieving all the messages?

The logic I am trying to implement is as follows:

query every second for the amount (A) of pending messages in a topic
if A > threshold, wake up a processor that would go ahead retrieving all the messages, and processing them
otherwise do nothing (sleep 1)

The motivation is that I need to do some batch processing, but I want the processor to wake up only when there is enough data (and I don't want to retrieve all the data twice).

Ewen Cheslack-Postava Ewen Cheslack-Postava · Accepted Answer · 2016-07-31T21:49:20

You can use the Consumer.seekToEnd() method, run Consumer.poll(0) to make that take effect but return immediately, then Consumer.position() to find the positions for all subscribed (or assigned) topic partitions. These will be the current final offsets for all partitions. This will also start fetching some data from the brokers for those offsets, but any returned data will be ignored if you subsequently seek back to a different position.

Currently the alternative, as mentioned by serejja, is to use the old simple consumer, although the process is quite a bit more complicated as you need to manually find the leader for each partition.

Can I retrieve the latest available offset for a Kafka partition without retrieving all the messages?

3 Answers