I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest" ) \
.load()
I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?