3
votes

Kafka cluster with 3 brokers(version:1.1.0) and is well running for over 6 months.

Then we modified partitions from 3 to 48 for every topic after 2018/12/12, then the brokers shutdown every 5-10 days.

Then we upgraded the broker from 1.1.0 to 2.1.0, but the brokers still keep shutting down every 5-10 days.

Each time, one broker shut down after the following error log, then several minutes later, the other 2 brokers shut down too, with the same error but other partition log files.

[2019-01-11 17:16:36,572] INFO [ProducerStateManager partition=__transaction_state-11] Writing producer snapshot at offset 807760 (kafka.log.ProducerStateManager)
[2019-01-11 17:16:36,572] INFO [Log partition=__transaction_state-11, dir=/kafka/logs] Rolled new log segment at offset 807760 in 4 ms. (kafka.log.Log)
[2019-01-11 17:16:46,150] WARN Resetting first dirty offset of __transaction_state-35 to log start offset 194404 since the checkpointed offset 194345 is invalid. (kafka.log.LogCleanerManager$)
[2019-01-11 17:16:46,239] ERROR Failed to clean up log for __transaction_state-11 in dir /kafka/logs due to IOException (kafka.server.LogDirFailureChannel)
java.nio.file.NoSuchFileException: /kafka/logs/__transaction_state-11/00000000000000807727.log
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
        at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
        at java.nio.file.Files.move(Files.java:1395)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:809)
        at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:222)
        at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:488)
        at kafka.log.Log.asyncDeleteSegment(Log.scala:1838)
        at kafka.log.Log.$anonfun$replaceSegments$6(Log.scala:1901)
        at kafka.log.Log.$anonfun$replaceSegments$6$adapted(Log.scala:1896)
        at scala.collection.immutable.List.foreach(List.scala:388)
        at kafka.log.Log.replaceSegments(Log.scala:1896)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:583)
        at kafka.log.Cleaner.$anonfun$doClean$6(LogCleaner.scala:515)
        at kafka.log.Cleaner.$anonfun$doClean$6$adapted(LogCleaner.scala:514)
        at scala.collection.immutable.List.foreach(List.scala:388)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:514)
        at kafka.log.Cleaner.clean(LogCleaner.scala:492)
        at kafka.log.LogCleaner$CleanerThread.cleanLog(LogCleaner.scala:353)
        at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:319)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:300)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
        Suppressed: java.nio.file.NoSuchFileException: /kafka/logs/__transaction_state-11/00000000000000807727.log -> /kafka/logs/__transaction_state-11/00000000000000807727.log.deleted
                at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
                at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
                at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
                at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
                at java.nio.file.Files.move(Files.java:1395)
                at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:806)
                ... 17 more
[2019-01-11 17:16:46,245] INFO [ReplicaManager broker=2] Stopping serving replicas in dir /kafka/logs (kafka.server.ReplicaManager)
[2019-01-11 17:16:46,314] INFO Stopping serving logs in dir /kafka/logs (kafka.log.LogManager)
[2019-01-11 17:16:46,326] ERROR Shutdown broker because all log dirs in /kafka/logs have failed (kafka.log.LogManager)
1

1 Answers

2
votes

if you have not changed log.retention.bytes or log.retention.hours or log.retention.minutes or log.retention.ms configs, Kafka tries to delete logs after 7 days. So based on the exception, Kafka wants to clean up file /kafka/logs/__transaction_state-11/00000000000000807727.log but, there is no such file in Kafka log directory and it throws an exception which causes broker shut down.

if you are able to shut down cluster and Zookeeper do it and clean up /kafka/logs/__transaction_state-11 manually.

Note: I don't know it is harmful or not but you can follow safely remove Kafka topic posts.