I'm in the process of making the case for our data forwarding job to switch from Spark batch to structured streaming. We use a Kafka source and a foreach
sink composed of socket connections.
With batch streaming I tried to enforce exactly-once semantics by storing the offset in zookeeper on every ACK back from a socket, but it's been prone to outage with production throughput a couple times a week most likely due to our offset management. I've now taken note from a frequent poster, Jacek Laskowski, regarding offset management:
You simply should not be dealing with this low-level "thing" called offsets that Spark Structured Streaming uses to offer exactly once guarantees.
I understand that since sockets are not idempotent, we cannot guarantee exactly once semantics through HDFS checkpointing. I've read that for structured, the offset will be checkpointed every trigger, but during a trial run without checkpointing I was seeing trigger durations every 25ms.
Would structured streaming really be able to store the offset every 25ms, and is this checkpoint period configurable from a structured streaming perspective? Keep in mind I have not installed HDFS on our spark workers yet, so if it's a simple configuration from the HDFS side than I apologize for the long question :)