When I start my Spark Structured Streaming 3.0.1 application from the latest offset it works well. But when I want to start from some earlier offsets - for example:
- startingOffsets to "earliest"
- startingOffsets to particular offset like {"MyTopic-v1":{"0":1686734237}}
I can see in the logs that the starting offset gets picked up correctly, but then a series of seeks is happening (starting from my defined position) until it reaches the current latest offset.
I dropped my checkpoint directory and tried several options but the scenario is always the same - it reports correct starting offset, but then takes a very long time just to slowly seek to the most recent and start processing - any idea why and what I should additionally check?
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786734237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786734737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786735237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786735737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786736237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786736737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO KafkaConsumer:1564 - [...] Seeking to offset 1786737237 for partition MyTopic-v1-0
I left the application running for longer time and it started producing the files eventually, but my processing trigger of 100 seconds was not met, the data showed up much later - after 20-30minutes.
(I tested it also on spark 2.4.5 - the same problem - maybe it's some kafka configuration?)