In my dataflow pipeline, I'll have two PCollections<TableRow>
that have been read from BigQuery tables. I plan to merge those two PCollections into one PCollection
with with a flatten
.
Since BigQuery is append only, the goal is to write truncate the second table in BigQuery with the a new PCollection
.
I've read through the documentation and it's the middle steps I'm confused about. With my new PCollection
the plan is to use a Comparator DoFn
to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
All PCollection<TableRow>
s will contain the same values: IE: string, integer and timestamp. When it comes to key value pairs, most of the documentation on cloud dataflow includes just simple strings. Is it possible to have a key value pair that is the entire row of the PCollection<TableRow>
?
The rows would look similar to:
customerID, customerName, lastUpdateDate
0001, customerOne, 2016-06-01 00:00:00
0001, customerOne, 2016-06-11 00:00:00
In the example above, I would want to filter the PCollection to just return the second row to a PCollection that would be written to BigQuery. Also, is it possible to apply these Pardo
's on the third PCollection without creating a fourth?