we have upgraded Kafka from 2.2 to 2.6, and added 4 new brokers to the existing 4 brokers. This went fine.
After that we started to reassign topic data to the new brokers. Most topics went ok, but on one of the 50 partitions of __consumer_offsets, the reassignment hangs. 49 of the partitions were successfully moved from the old brokers (id 3,4,5,6) to the new (ids 10,11,12,13).
But on __consumer_offsets-18 we consistently get this error (in the server.log of the new brokers)
[2020-10-24 15:04:54,528] ERROR [ReplicaFetcher replicaId=10, leaderId=3, fetcherId=0] Unexpected error occurred while processing data for partition __consumer_offsets-18 at offset 1545264631 (kafka.server.ReplicaFetcherThread)
kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append to __consumer_offsets-18: ArrayBuffer(1545264631, 1545264632,
... thousands of other ids
1545272005, 1545272006, 1545272007)
at kafka.log.Log.$anonfun$append$2(Log.scala:1126)
at kafka.log.Log.append(Log.scala:2340)
at kafka.log.Log.appendAsFollower(Log.scala:1036)
at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:939)
at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:946)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:168)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:332)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:320)
at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:319)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:319)
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:135)
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:134)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:117)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2020-10-24 15:04:54,534] INFO [ReplicaFetcher replicaId=10, leaderId=4, fetcherId=0] Truncating partition __consumer_offsets-31 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2020-10-24 15:04:54,547] WARN [ReplicaFetcher replicaId=10, leaderId=3, fetcherId=0] Partition __consumer_offsets-18 marked as failed (kafka.server.ReplicaFetcherThread)
Any idea what is wrong here? The whole cluster seems to process data nicely. It's just that we can't move over this particular partition. We tried various things (cancelling the reassignment, restarting it with all partitions, restarting it with just partition 18, restarting all brokers) to no avail
Help very much appreciated, this is happening in PROD only after it worked successfully on all test environments.
EDIT: we actually found that in the huge list of message offsets in the list of the exception, there is actually a descrepancy. The relevant part of this list is
... 1545271418, 1545271419, 1, 1, 1545271422, 1545271423,
Obviously the two '1' entries there look really wrong! They should be 1545271420/1545271421. Could it be that the leader really has some kind of data corruption?