How often is the checkpoint period for structured streaming, and is it configurable?

Question

I'm in the process of making the case for our data forwarding job to switch from Spark batch to structured streaming. We use a Kafka source and a foreach sink composed of socket connections.

With batch streaming I tried to enforce exactly-once semantics by storing the offset in zookeeper on every ACK back from a socket, but it's been prone to outage with production throughput a couple times a week most likely due to our offset management. I've now taken note from a frequent poster, Jacek Laskowski, regarding offset management:

You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.

I understand that since sockets are not idempotent, we cannot guarantee exactly once semantics through HDFS checkpointing. I've read that for structured, the offset will be checkpointed every trigger, but during a trial run without checkpointing I was seeing trigger durations every 25ms.

Would structured streaming really be able to store the offset every 25ms, and is this checkpoint period configurable from a structured streaming perspective? Keep in mind I have not installed HDFS on our spark workers yet, so if it's a simple configuration from the HDFS side than I apologize for the long question :)

Akhil Bojedla Akhil Bojedla · Accepted Answer · 2018-03-12T16:22:36

You can configure the trigger frequency as below:

import org.apache.spark.sql.streaming.Trigger    
val query = resultTable
          .writeStream
          .outputMode(OutputMode.Update())
          .option("checkpointLocation", "hdfs://path/to/checkpoints")
          .trigger(Trigger.ProcessingTime(10.seconds))
          .foreach(writer)
          .start()
query.awaitTermination()

How often is the checkpoint period for structured streaming, and is it configurable?

1 Answers