How to manually commit offset in Spark Kafka direct streaming?

Question

I looked around hard but didn't find a satisfactory answer to this. Maybe I'm missing something. Please help.

We have a Spark streaming application consuming a Kafka topic, which needs to ensure end-to-end processing before advancing Kafka offsets, e.g. updating a database. This is much like building transaction support within the streaming system, and guaranteeing that each message is processed (transformed) and, more importantly, output.

I have read about Kafka DirectStreams. It says that for robust failure-recovery in DirectStreaming mode, Spark checkpointing should be enabled, which stores the offsets along with the checkpoints. But the offset management is done internally (setting Kafka config params like ["auto.offset.reset", "auto.commit.enable", "auto.offset.interval.ms"]). It does not speak of how (or if) we can customize committing offsets (once we've loaded a database, for e.g.). In other words, can we set "auto.commit.enable" to false and manage the offsets (not unlike a DB connection) ourselves?

Any guidance/help is greatly appreciated.

Is there any python implementation available for the manual offset commit in pyspark. I am not able to find it anywhere — Girish Gupta

rakesh rakesh · Accepted Answer · 2016-07-28T17:17:58

The article below could be a good start to understand the approach.

spark-kafka-achieving-zero-data-loss

Further more,

The article suggests using zookeeper client directly, which can be replaced by something like KafkaSimpleConsumer also. The advantage of using Zookeper/KafkaSimpleConsumer is the monitoring tools that depend on Zookeper saved offset. Also the information can also be saved on HDFS or any other reliable service.

How to manually commit offset in Spark Kafka direct streaming?

3 Answers