We want to write compressed data to HDFS by Flink's BucketingSink or StreamingFileSink. I have write my own Writer which works fine if no failure occurs. However when It encounters a failure and restart from checkpoint, It will generate valid-length file(hadoop < 2.7) or truncate the file. Unluckily gzips are binary files which have trailer at the end of file. Therefore simple truncation does not work in my case. Any ideas to enable exactly-once semantic for compression hdfs sink?
That's my writer's code:
public class HdfsCompressStringWriter extends StreamWriterBaseV2<JSONObject> {
private static final long serialVersionUID = 2L;
/**
* The {@code CompressFSDataOutputStream} for the current part file.
*/
private transient GZIPOutputStream compressionOutputStream;
public HdfsCompressStringWriter() {}
@Override
public void open(FileSystem fs, Path path) throws IOException {
super.open(fs, path);
this.setSyncOnFlush(true);
compressionOutputStream = new GZIPOutputStream(this.getStream(), true);
}
public void close() throws IOException {
if (compressionOutputStream != null) {
compressionOutputStream.close();
compressionOutputStream = null;
}
resetStream();
}
@Override
public void write(JSONObject element) throws IOException {
if (element == null || !element.containsKey("body")) {
return;
}
String content = element.getString("body") + "\n";
compressionOutputStream.write(content.getBytes());
compressionOutputStream.flush();
}
@Override
public Writer<JSONObject> duplicate() {
return new HdfsCompressStringWriter();
}
}