1
votes

I am trying to understand the following behavior of message loss in Kafka. Briefly, when a broker dies early on and subsequently after some message processing, all other brokers die. If the broker that died first starts up, then it does not catch up with other brokers after they come up. Instead all the other brokers report errors and reset their offset to match the first broker. Is this behavior expected and what are the changes/settings to ensure zero message loss?

Kafka version: 2.11-0.10.2.0

Reproducible steps

  • Started 1 zookeeper instance and 3 kafka brokers
  • Created one topic with replication factor of 3 and partition of 3
  • Attached a kafka-console-consumer to topic
  • Used Kafka-console-producer to produce 2 messages
  • Killed two brokers (1&2)
  • Sent two messages
  • Killed last remaining broker (0)
  • Bring up broker (1) who had not seen the last two messages
  • Bring up broker (2) who had seen the last two messages and it shows an error
[2017-06-16 14:45:20,239] INFO Truncating log my-second-topic-1 to offset 1. (ka
fka.log.Log)
[2017-06-16 14:45:20,253] ERROR [ReplicaFetcherThread-0-1], Current offset 2 for
 partition [my-second-topic,1] out of range; reset offset to 1 (kafka.server.Rep
licaFetcherThread)
  • Finally connect kafka-console-consumer and it sees two messages instead of the four that were published.
2

2 Answers

3
votes

Response here : https://kafka.apache.org/documentation/#producerconfigs

The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed:

  • acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1.
  • acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.
  • acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.

By default acks=1 so set it to 'all' : acks=all in your producer.properties file

3
votes

Check if unclean.leader.election.enable = true and if so, set it to false so that only replicas that are insync can become the leader. If an out of sync replica is allowed to become leader then messages can get truncated and lost.