Flink bucketing sink restart with save point cause data loss

Question

I am using Flink bucketing sink from Kafka to HDFS. The version of Flink is 1.4.2.

I found that there is some data loss every time I restart the job, even with save point.

I found that this problem can be solved if I set the writer SequenceFile.CompressionType.RECORD instead of SequenceFile.CompressionType.BLOCK. It seems when Flink trying to save the checkpoint, the valid length is different from the real length, which should include the compressing data.

But it could be a problem if we can not use the CompressionType.BLOCK due to the disk usage. How can I prevent data loss using Block compression when restarting the job?

Is this a known issue for Flink? Or anybody knows how to solve this problem?

Till Rohrmann Till Rohrmann · Accepted Answer · 2019-04-24T12:18:17

Flink's BucketingSink is no longer being recommended to be used. Instead the community recommends using the StreamingFileSink which has been introduced with Flink 1.6.0.

Flink bucketing sink restart with save point cause data loss

1 Answers