Spark Structured Streaming Batch Query

Question

I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.

 df = spark \
     .read \
     .format("kafka") \
     .option("kafka.bootstrap.servers", broker) \
     .option("subscribe", topic) \
     .option("startingOffsets", "earliest") \
     .option("endingOffsets", "latest" ) \
     .load()

I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?

Michael Heil Michael Heil · Accepted Answer · 2020-10-23T22:20:57

According to the Structured Streaming + Kafka Integration Guide your offsets are stored in the provided checkpoint location that you set in the write part of your batch job.

If you do not delete the checkpoint files, the job will continue to read from Kafka where it left off. If you delete the checkpoint files or if you run the job for the very first time the job will consume messages based on the option startingOffsets.

Spark Structured Streaming Batch Query

1 Answers