Filtering bounded data in Dataflow based on timestamp

Question

In my dataflow pipeline, I'll have two PCollections<TableRow> that have been read from BigQuery tables. I plan to merge those two PCollections into one PCollection with with a flatten.

Since BigQuery is append only, the goal is to write truncate the second table in BigQuery with the a new PCollection.

I've read through the documentation and it's the middle steps I'm confused about. With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?

All PCollection<TableRow>s will contain the same values: IE: string, integer and timestamp. When it comes to key value pairs, most of the documentation on cloud dataflow includes just simple strings. Is it possible to have a key value pair that is the entire row of the PCollection<TableRow>?

The rows would look similar to:

customerID, customerName, lastUpdateDate
0001, customerOne, 2016-06-01 00:00:00
0001, customerOne, 2016-06-11 00:00:00

In the example above, I would want to filter the PCollection to just return the second row to a PCollection that would be written to BigQuery. Also, is it possible to apply these Pardo's on the third PCollection without creating a fourth?

Kenn Knowles Kenn Knowles · Accepted Answer · 2016-06-13T17:36:03

You've asked a few questions. I have tried to answer them in isolation, but I may have misunderstood the whole scenario. If you provided some example code, it might help to clarify.

With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?

Based on your description, it seems that you want to take a PCollection of elements and for each customerID (the key) find the most recent update to that customer's record. You can use the provided transforms to accomplish this via Top.largestPerKey(1, timestampComparator) where you set up your timestampComparator to look only at the timestamp.

Is it possible to have a key value pair that is the entire row of the PCollection?

A KV<K, V> can have any type for the key (K) and value (V). If you want to group by key, then the coder for the keys needs to be deterministic. TableRowJsonCoder is not deterministic, because it may contain arbitrary objects. But it sounds like you want to have the customerID for the key and the entire TableRow for the value.

is it possible to apply these Pardo's on the third PCollection without creating a fourth?

When you apply a PTransform to a PCollection, it results in a new PCollection. There is no way around that, and you don't need to try to minimize the number of PCollections in your pipeline.

A PCollection is a conceptual object; it does not have intrinsic cost. Your pipeline is going to be heavily optimized so that many intermediate PCollections - especially those in a sequence of ParDo transforms - will never be materialized anyhow.

Filtering bounded data in Dataflow based on timestamp

1 Answers