Flink Streaming AWS S3 read multiple files in parallel

Question

I am new to Flink, my understanding is following API call

StreamExecutionEnvironment.getExecutionEnvironment().readFile(format, path)

will read the files in parallel for given S3 bucket path.

We are storing log files in S3. The requirement is to serve multiple client requests to read from different folders with time stamps.

For my use case, to serve multiple client request, I am evaluating to use Flink. So I want Flink to perform AWS S3 read in parallel for different AWS S3 file paths.

Is it possible to achieve this in single Flink Job. Any suggestions?

twalthr twalthr · Accepted Answer · 2017-07-18T15:28:17

Documentation about the S3 file system support can be found here.

You can read from different directories and use the union() operator to combine all the records from different directories into one stream.

It is also possible to read nested files by using something like (untested):

TextInputFormat format = new TextInputFormat(path);
Configuration config = new Configuration();
config.setBoolean("recursive.file.enumeration", true);
format.configure(this.config);
env.readFile(format, path);

Flink Streaming AWS S3 read multiple files in parallel

1 Answers