Apache flink with S3 as source and S3 as sink

Question

Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink? How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.

David Anderson David Anderson · Accepted Answer · 2020-06-28T09:01:43

In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.

With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).

Apache flink with S3 as source and S3 as sink

1 Answers