Running the reassignment partitions tool, to expand the partitions over 5 brokers instead of 5. Kafka 2.1, on Docker.
It gets to a point where one of the nodes behave erratically. The other (healthy) nodes begin to show these messages:
[2018-12-27 13:00:31,618] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=48303608, epoch=226826) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2018-12-27 13:00:31,620] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={impressions-35=(offset=3931626, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-26=(offset=4273048, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-86=(offset=3660830, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-93=(offset=2535787, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[26]), impressions-53=(offset=3683354, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), impressions-59=(offset=3696315, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[29]), impressions-11=(offset=3928338, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-69=(offset=2510463, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-72=(offset=2481181, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[28]), events-75=(offset=2462527, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-126=(offset=2510344, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27]), events-63=(offset=2515896, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[27])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=48303608, epoch=226826)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
at scala.Option.foreach(Option.scala:257)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
15 minutes later, the healthy servers show the following messages:
[2018-12-27 13:16:00,540] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition events-111 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)
Later on we can see lots of these messages:
[2018-12-27 17:20:21,132] WARN [ReplicaManager broker=1] While recording the replica LEO, the partition events-116 hasn't been created. (kafka.server.ReplicaManager)
among other sets of these, more common:
[2018-12-27 17:20:21,138] WARN [ReplicaManager broker=1] Leader 1 failed to record follower 3's position 2517140 since the replica is not recognized to be one of the ass
igned replicas 1,4,6 for partition events-53. Empty records will be returned for this partition. (kafka.server.ReplicaManager)
The reassigned topic had 128 partitions among 3 servers. All in all, each server has around 2000 partitions.
Now reassignment is stuck for 6 hours, showing a stuck 41% partitions underreplicated. It had replication 3, although it now has replication 5. I suppose this is part of how the rebalancing happens underneath, in order to increase these replicas and later drop those that are unwanted?
Node 3 is however showing these messages:
[2018-12-27 17:10:05,509] WARN [RequestSendThread controllerId=3] Controller 3 epoch 14 fails to send request (type=LeaderAndIsRequest, controllerId=3, controllerEpoch=14, partitionStates={events-125=PartitionState(controllerEpoch=14, leader=1, leaderEpoch=25, isr=3,1,2, zkVersion=57, replicas=1,6,2,3, isNew=false)}, liveLeaders=(172.31.10.35:9092 (id: 1 rack: eu-west-1c))) to broker 172.31.27.111:9092 (id: 3 rack: eu-west-1a). Reconnecting to broker. (kafka.controller.RequestSendThread)
So, something is wrong with the node "3" - how can I know what happened with it?
It has happened the two times we tried reassigning partitions in two topics of the same partition size. In the previous case, we brought up another machine as a new broker (restarting container didn't help) with the same Id and it recovered. But, how can this be avoided to happen?
What can be the root cause?