Kafka stuck on reassign partitions tool and progress

Question

Running the reassignment partitions tool, to expand the partitions over 5 brokers instead of 5. Kafka 2.1, on Docker.

It gets to a point where one of the nodes behave erratically. The other (healthy) nodes begin to show these messages:

[2018-12-27 13:00:31,618] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=48303608, epoch=226826) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2018-12-27 13:00:31,620] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={impressions-35=(offset=3931626, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-26=(offset=4273048, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-86=(offset=3660830, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-93=(offset=2535787, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[26]), impressions-53=(offset=3683354, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-59=(offset=3696315, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-11=(offset=3928338, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-69=(offset=2510463, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-72=(offset=2481181, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-75=(offset=2462527, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-126=(offset=2510344, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-63=(offset=2515896, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=48303608, epoch=226826)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
    at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
    at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
    at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
    at scala.Option.foreach(Option.scala:257)
    at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

15 minutes later, the healthy servers show the following messages:

[2018-12-27 13:16:00,540] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition events-111 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)

Later on we can see lots of these messages:

[2018-12-27 17:20:21,132] WARN [ReplicaManager broker=1] While recording the replica LEO, the partition events-116 hasn't been created. (kafka.server.ReplicaManager)

among other sets of these, more common:

[2018-12-27 17:20:21,138] WARN [ReplicaManager broker=1] Leader 1 failed to record follower 3's position 2517140 since the replica is not recognized to be one of the ass

igned replicas 1,4,6 for partition events-53. Empty records will be returned for this partition. (kafka.server.ReplicaManager)

The reassigned topic had 128 partitions among 3 servers. All in all, each server has around 2000 partitions.

Now reassignment is stuck for 6 hours, showing a stuck 41% partitions underreplicated. It had replication 3, although it now has replication 5. I suppose this is part of how the rebalancing happens underneath, in order to increase these replicas and later drop those that are unwanted?

Node 3 is however showing these messages:

[2018-12-27 17:10:05,509] WARN [RequestSendThread controllerId=3] Controller 3 epoch 14 fails to send request (type=LeaderAndIsRequest, controllerId=3, controllerEpoch=14, partitionStates={events-125=PartitionState(controllerEpoch=14, leader=1, leaderEpoch=25, isr=3,1,2, zkVersion=57, replicas=1,6,2,3, isNew=false)}, liveLeaders=(172.31.10.35:9092 (id: 1 rack: eu-west-1c))) to broker 172.31.27.111:9092 (id: 3 rack: eu-west-1a). Reconnecting to broker. (kafka.controller.RequestSendThread)

So, something is wrong with the node "3" - how can I know what happened with it?

It has happened the two times we tried reassigning partitions in two topics of the same partition size. In the previous case, we brought up another machine as a new broker (restarting container didn't help) with the same Id and it recovered. But, how can this be avoided to happen?

What can be the root cause?

xmar xmar · Accepted Answer · 2019-03-27T15:05:53

Some time past since this was written. However if it helps anyone, I think the settings changes that helped were: to increase zookeeper.session.timeout.ms , zookeeper.connection.timeout.ms, and replica.lag.time.max.ms in our case to 60000.

Since then it has not happened again. The idea behind is that at some point one of the brokers loses ZK session, and that creates misconnection between the brokers that think the broker is still alive and ZK who thinks it is not. For some reason this doesn't ever get cleaned. Increasing these settings allow for longer session stickiness times. Beware, it also would take longer to expire brokers that are genuinely dead.

Kafka stuck on reassign partitions tool and progress

1 Answers