I am trying to understand the following behavior of message loss in Kafka. Briefly, when a broker dies early on and subsequently after some message processing, all other brokers die. If the broker that died first starts up, then it does not catch up with other brokers after they come up. Instead all the other brokers report errors and reset their offset to match the first broker. Is this behavior expected and what are the changes/settings to ensure zero message loss?
Kafka version: 2.11-0.10.2.0
Reproducible steps
- Started 1 zookeeper instance and 3 kafka brokers
- Created one topic with replication factor of 3 and partition of 3
- Attached a kafka-console-consumer to topic
- Used Kafka-console-producer to produce 2 messages
- Killed two brokers (1&2)
- Sent two messages
- Killed last remaining broker (0)
- Bring up broker (1) who had not seen the last two messages
- Bring up broker (2) who had seen the last two messages and it shows an error
[2017-06-16 14:45:20,239] INFO Truncating log my-second-topic-1 to offset 1. (ka fka.log.Log) [2017-06-16 14:45:20,253] ERROR [ReplicaFetcherThread-0-1], Current offset 2 for partition [my-second-topic,1] out of range; reset offset to 1 (kafka.server.Rep licaFetcherThread)
- Finally connect kafka-console-consumer and it sees two messages instead of the four that were published.