2
votes

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.

    PCollection<String> val = p
            .apply(FileIO.match()
                    .filepattern("gcsPath"))
            .apply(FileIO.readMatches().withCompression(Compression.BZIP2))
            .apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
                try {
                    return f.readFullyAsUTF8String();
                } catch (IOException e) {
                    return null;
                }
            }));

But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.

1
What's the total transfer size of 300k files? - Parth Mehta
Is it running on Dataflow or on your computer? - guillaume blaquiere
@ParthMehta Total transfer size is around 1 TB. - miles212
After fixing the template size error as per chamikara comment, if you use TextIO you can also make use of beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/… - Reza Rokni
@RezaRokni I just saw your answer, but I got the same solution yesterday. Thanks for your help. - miles212

1 Answers

0
votes

Below TextIO() code solved the issue.

PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
                        .withCompression(Compression.AUTO).withHintMatchesManyFiles()
                        );              

withHintMatchesManyFiles() solved the issue. But still I don't know while FileIO performance is so bad.