1
votes

Is it possible to write to 2nd BigQuery table after writing to 1st has finished in a batch pipeline using Wait.on() method(new feature in Apache Beam 2.4)? The example given in the Apache Beam documentation is:

 PCollection<Void> firstWriteResults = data.apply(ParDo.of(...write to first database...));
 data.apply(Wait.on(firstWriteResults))
     // Windows of this intermediate PCollection will be processed no earlier than when
     // the respective window of firstWriteResults closes.
     .apply(ParDo.of(...write to second database...));

But why would I write to database from within ParDo? Can we not do the same by using the I/O transforms given in Dataflow?

Thanks.

1
encountering a similar issue - it seems Wait.on is meant to be used on a PCollection, but database sinks result in a PDone. Is there some way to use Wait.on() (or an equivalent) on either a PDone, or on the Pipeline itself? - gilmatic
@gilmatic right. But then what exactly do they mean by "write to database"? Do I need to make an API call !!? I don't think that's a good idea... - rish0097

1 Answers

2
votes

Yes this is possible, although there are some known limitations and there is currently some work being done to further support this.

In order to make this work you can do something like the following:

WriteResult writeResult = data.apply(BigQueryIO.write()
     ...
     .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS) 
);

data.apply(Wait.on(writeResults.getFailedInserts()))
    .apply(...some transform which writes to second database...);

It should be noted that this only works with streaming inserts and wont work with file loads. At the same time there is some work being done currently to better support this use case that you can follow here

Helpful references: