In Spark Structured streaming with Kafka, how spark manages offset for multiple topics

Question

Consider a Spark Structured Streaming job that reads the messages from the Kafka.

In case we have defined multiple topics, how does code manages offset for each topic?

I have been going through KafkaMicroBatchStream class and not able to get how if get's offset for different topics.

The def latestOffset(start: Offset, readLimit: ReadLimit): Offset; method will return only one offset.

Trying to understand the implementation as I need to write my custom source that reads from multiple RDBMs tables and each table would have it's own offset. The offset would be manages in RDBMS table only.

@OneCricketeer Just trying to understand, how in case of multiple topics the offset is managed by Spark-Kafka integration. — JDev
Well, that's going to depend if you store the offsets in checkpoints, back in Kafka, in Zookeeper, or elsewhere, but in general (or how Kafka does it on its own) is that each topic-partition is stored for the entire consumer group — OneCricketeer

mike mike · Accepted Answer · 2020-12-07T09:06:46

When a Structured Streaming job fetches data from a Kafka source, the offsets are typically stored in a checkpoint file. In those files you will find the latest processed offsets per TopicPartition (based on the Consumer Group created by the Structured Streaming job). The term "TopicPartition" means that the offsets are stored per Topic per Partition.

This checkpointing is applicable for Kafka topics as source because the offsets are a unique identifier which does never change during the life-time of the message.

When reading from an RDBMs you would need to track each row that has been already consumed by the streaming job, e.g. by tracking the primary keys. However, you need to think about updates on rows already consumed.

I assume that is the reason, why there are (not yet) Structured Streaming sources for RDBMs available, as mentioned in the Structured Streaming Programming Guide on Input Source:

File source
Kafka source
Socket source (for testing)
Rate source (for testing)

In Spark Structured streaming with Kafka, how spark manages offset for multiple topics

1 Answers