4
votes

I tried using the following

TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")

That pattern didn't work, as I get

java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}

Even though those 2 files do exist. And I tried with a local file using a similar expression

TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")

And that did work just fine.

I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?

3
Docs say it supports "Standard Java Filesystem globbing" but it appears to only support these * matches everything , ? matches any single character , [seq] matches any character in seq , [!seq] matches any char not in seqDavos

3 Answers

10
votes

This may be another option, in addition to Scott's suggestion and your comment on his answer:

You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:

PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));

Then create a PCollectionList:

PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);

And then flatten this list into your PCollection for your main input:

PCollection<String> events = eventsList.apply(Flatten.pCollections());

3
votes

Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.

GCS glob wildcard patterns are documented here (Wildcard Names).

In the case above, you could use:

TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")

Note however that this will also include any other matching files.

-4
votes

Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:

public TextIO.Read from(java.lang.String filepattern)

Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).

Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.