My spark streaming application fetches data from Kafka and do processing on them.
In case of application failure, huge amounts of data are stored in Kafka and at the next start-up of Spark Streaming application, it crashes because too much data is consumed at once. Since my application does not concern about the past data, it's totally fine to consume the current(latest) data only.
I found "auto.reset.offest" option and it behaves little different in Spark. It deletes the offsets stored in zookeeper, if it is configured. Despite however, its unexpected behavior, it is supposed to fetch data from the latest after deletion.
But I found it's not. I saw all the offsets are cleaned up before consuming the data. Then, because of default behavior, it should fetch the data as expected. But it still crashes due to too much data.
When I clean up the offset and consume data from the latest using "Kafka-Console-Consumer", and run my application, it works as expected.
So it looks "auto.reset.offset" does not work and kafka consumer in spark streaming fetches data from the "smallest" offset as default.
Do you have any idea on how to consume Kafka data from the latest in spark streaming?
I am using spark-1.0.0 and Kafka-2.10-0.8.1.
Thanks in advance.