Facing Performance issue while reading files from GCS using apache beam

Question

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.

    PCollection<String> val = p
            .apply(FileIO.match()
                    .filepattern("gcsPath"))
            .apply(FileIO.readMatches().withCompression(Compression.BZIP2))
            .apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
                try {
                    return f.readFullyAsUTF8String();
                } catch (IOException e) {
                    return null;
                }
            }));

But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.

After fixing the template size error as per chamikara comment, if you use TextIO you can also make use of beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/… — Reza Rokni
@RezaRokni I just saw your answer, but I got the same solution yesterday. Thanks for your help. — miles212

miles212 miles212 · Accepted Answer · 2020-02-11T18:07:42

Below TextIO() code solved the issue.

PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
                        .withCompression(Compression.AUTO).withHintMatchesManyFiles()
                        );

withHintMatchesManyFiles() solved the issue. But still I don't know while FileIO performance is so bad.

Facing Performance issue while reading files from GCS using apache beam

1 Answers