0
votes

we have upgraded Kafka from 2.2 to 2.6, and added 4 new brokers to the existing 4 brokers. This went fine.

After that we started to reassign topic data to the new brokers. Most topics went ok, but on one of the 50 partitions of __consumer_offsets, the reassignment hangs. 49 of the partitions were successfully moved from the old brokers (id 3,4,5,6) to the new (ids 10,11,12,13).

But on __consumer_offsets-18 we consistently get this error (in the server.log of the new brokers)

[2020-10-24 15:04:54,528] ERROR [ReplicaFetcher replicaId=10, leaderId=3, fetcherId=0] Unexpected error occurred while processing data for partition __consumer_offsets-18 at offset 1545264631 (kafka.server.ReplicaFetcherThread)
kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append to __consumer_offsets-18: ArrayBuffer(1545264631, 1545264632, 
... thousands of other ids
1545272005, 1545272006, 1545272007)
    at kafka.log.Log.$anonfun$append$2(Log.scala:1126)
    at kafka.log.Log.append(Log.scala:2340)
    at kafka.log.Log.appendAsFollower(Log.scala:1036)
    at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:939)
    at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:946)
    at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:168)
    at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:332)
    at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:320)
    at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:319)
    at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
    at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
    at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:319)
    at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:135)
    at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:134)
    at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:117)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2020-10-24 15:04:54,534] INFO [ReplicaFetcher replicaId=10, leaderId=4, fetcherId=0] Truncating partition __consumer_offsets-31 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2020-10-24 15:04:54,547] WARN [ReplicaFetcher replicaId=10, leaderId=3, fetcherId=0] Partition __consumer_offsets-18 marked as failed (kafka.server.ReplicaFetcherThread)

Any idea what is wrong here? The whole cluster seems to process data nicely. It's just that we can't move over this particular partition. We tried various things (cancelling the reassignment, restarting it with all partitions, restarting it with just partition 18, restarting all brokers) to no avail

Help very much appreciated, this is happening in PROD only after it worked successfully on all test environments.

EDIT: we actually found that in the huge list of message offsets in the list of the exception, there is actually a descrepancy. The relevant part of this list is

... 1545271418, 1545271419, 1, 1, 1545271422, 1545271423,

Obviously the two '1' entries there look really wrong! They should be 1545271420/1545271421. Could it be that the leader really has some kind of data corruption?

1

1 Answers

1
votes

We were ultimately able to solve this issue - mainly by sitting and waiting

The issue was indeed that somewhen, somehow, the data on the leader of this __consumer_offset-18 partition got corrupted. This probably happened during the upgrade from Kafka 2.2 -> 2.6. We were doing this in a rather dangerous way, as we know now: we simply stopped all brokers, updated the SW on all, and restarted them. We thought since we can afford this outage on the weekend, this would be a safe way. But we will certainly never do that again. At least not unless we know 100% that all producers and all consumers are really stopped. This was NOT the case during that upgrade, we overlooked one consumer and left that running, and that consumer (group) was storing their offsets in the __consumer_offset-18 partition. So that action - taking all brokers down and upgrade them - while consumers/producers are still running, did probably cause the corruption

Lesson learnt: never do that again, always do rolling upgrades, even if they take a lot longer.

The issue was then actually solved through the nature of the compacting topic. The default settings for compaction are to run it every week (or when a segment gets bigger than 1GB, which will not happen in our case). Compaction of that __consumer_offset-18 partition kicked in yesterday evening. And by that the 2 corrupted offsets were purged away. After that it was uphill, the reassign of that partition to the new brokers then worked like a charm.

We could certainly have speeded up this recovery by setting the topic parameters in a way that compaction would kick in earlier. But it was only on Wednesday when we reached the understanding that the problem could actually be resolved that way. We decided to leave everything as it was and wait another day.