0
votes

For example, I have a list of URLs as strings which are stored in Datastore. So, I used the DatastoreIO function and read them into a PCollection. In ParDo’s DoFn, for each URL (which is a GCP cloud storage location of a file), I have to read the file present in that location and do further transformations.

So I want to know if I can write ParDo for PCollections inside a ParDo function. Kind of parallel execution of each file transformation and send KV (key, PCollection) something as output of the first ParDo function.

Sorry, if I haven't presented my scenario clearly. I'm a newbie to Apache Beam & Google Dataflow

1

1 Answers

2
votes

What you want is TextIO#readAll().

PCollection<String> urls = pipeline.apply(DatastoreIO.read(...))
PCollection<String> lines = urls.apply(TextIO.readAll())