0
votes

I am using spark structured streaming (2.2.1) to consume a topic from Kafka (0.10).

 val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", fromKafkaServers)
      .option("subscribe", topicName)
      .option("startingOffset", "earliest")
      .load()

My checkpoint location is set on an external HDFS dir. In some cases, I would like to restart the streaming application and consume data from the beginning. However, even though I delete all the checkpointing data from the HDFS dir and resubmit the jar, Spark is still able to find my last consumed offset and resume from there. Where else does the offset live? I suspect it is related to Kafka Consumer Id. However, I am unable to set group.id with spark structured streaming per Spark Doc and it seems like all applications subscribing to the same topic gets assigned to one consumer group. What if I would like to have two independent streaming job running that subscribes to the same topic?

1

1 Answers

4
votes

You have a typo :) It's startingOffsets