I'm using a cluster of Apache Flink 1.3.2. We're consuming Kafka messages and since upgrading the broker to 1.1.0 (from 0.10.2) we noticed this error in the log frequently:
ERROR o.a.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - Async Kafka commit failed.
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing offsets.
Caused by: org.apache.kafka.common.errors.DisconnectException: null
Due to this sometimes we experience missing events during processing. We use FlinkKafkaConsumer010 in the job.
Checkpointing is enabled (Interval 10 s, Timeout 1 minute, Minimum pause between checkpoints 5s, Maximum concurrent checkpoints 1. E2E duration on average is under 1s, under half a second even I'd say.) Same settings were used with Kafka 0.10.2 where we don't have this exception.
Update: We have reinstalled Kafka and now we get a warning message but still no events are read
WARN o.a.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity.