0
votes

I am using Flink bucketing sink from Kafka to HDFS. The version of Flink is 1.4.2.

I found that there is some data loss every time I restart the job, even with save point.

I found that this problem can be solved if I set the writer SequenceFile.CompressionType.RECORD instead of SequenceFile.CompressionType.BLOCK. It seems when Flink trying to save the checkpoint, the valid length is different from the real length, which should include the compressing data.

But it could be a problem if we can not use the CompressionType.BLOCK due to the disk usage. How can I prevent data loss using Block compression when restarting the job?

Is this a known issue for Flink? Or anybody knows how to solve this problem?

1

1 Answers

1
votes

Flink's BucketingSink is no longer being recommended to be used. Instead the community recommends using the StreamingFileSink which has been introduced with Flink 1.6.0.