I'm having a spark structured streaming job that need to read data from kafka topic and do some aggregation. The job needed to restart daily but when it restart, if I set startingOffsets="latest"
, I'll loss the data that coming between the restarting time. If I set startingOffsets="earliest"
then the job will read all data from topic but not from where the last streaming job left. Can anyone help me how to config to set the offset right where last streaming job left?
I'm using Spark 2.4.0 and kafka 2.1.1, I have tried to set checkpoint location for the writing job but it seem like Spark doesn't check for the offset of kafka message so it keep check the last offset or first offset depend on startingOffsets.
Here is the config for my spark to read from kafka:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", host)
.option("subscribe", topic)
.option("startingOffsets", offset)
.option("enable.auto.commit", "false")
.load()
with example that kafka topic has 10 message with offset from 1 to 10, spark just done processing message 5 and then restart. How can I make spark continue reading from message 5 not from 1 or 11?
checkpointLocation
with hdfs compatible location. – Gowtham