Am I safe to use Kafka and Spark Structured Streaming (SSS) (>=v2.2) with checkpointing on HDFS in cases where I have to upgrade the Spark library or when changing the query? I'd like to seamlessly continue with the offset left behind even in those cases.
I've found different answers when searching the net for compatibility issues in SSS's (>=2.2) checkpoint mechanism. Maybe someone out there can lighten up the situation ... in best case backed up with facts/references or first-person experience ?
- In Spark's programming guide (current=v2.3) they just claim "..should be a directory in an HDFS-compatible" but don't even leave a single word about constraints in terms of compatibility. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
- Databricks at least gives some hints that this is an issue at all. https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-after-changes-in-a-streaming-query
- A Cloudera blog recommends storing the offset rather in Zookeeper, but this actually refers to the "old" Spark Streaming implementation. If this is relates to structured streaming, too, is unclear. https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
- A guy in this conversation claims that there is no problem on that regard anymore ...but without pointing to facts. How to get Kafka offsets for structured query for manual and reliable offset management?
Help is highly appreciated.