How to implement a lookback in Apache Beam / Cloud Dataflow

Question

I'm implementing an ETL pipeline based on Apache Beam using Python, running on Google Cloud Dataflow. The process itself is simple:

1. Read data from BigQuery using BigQuerySource
2. Transform data
    2.1. Performing basic transforms on row level
    2.2. Calculate various parameters based on historical data
3. Write data into another BigQuery table using BigQuerySink

The problem is 2.2. - in order to calculate the parameters, I don't only need the current row, but also the previous X (typically 100-500) rows from the same table.

Of course, I could just run another query to load the data, but that would be very inefficient. What's the most efficient and easy way to implement this?

Alec Alec · Accepted Answer · 2022-01-12T21:57:00

My first question is: Why not do this in BigQuery? I'm no BQ expert but I think window functions or the LAG function would allow you to say: "For each result, give me the previous N results".

Beam does not have any concept of "previous N records" out of the box. By design, PCollection elements have no context of any adjacent records. You can probably use a GroupByKey function for this however.

Duplicate each row into key/value pairs for ALL elements to which this might possible pertain
Use GroupByKey to group

For example, if your table looks like this:

my_key | my_value
1      | "Goodbye Earl"
2      | "Cowboy Take Me Away"
3      | "Sin Wagon"
4      | "Traveling Soldier"

Assume each row requires the previous 2. In this case, row 1 will eventually be needed by row 2 and row 3. Row 2 by row 3 and row 4; etc. Create a DoFn class which outputs all of these elements for each element.

[
    (1, "Goodbye Earl"),
    (2, "Goodbye Earl"),
    (3, "Goodbye Earl")
]

[
    (2, "Cowboy Take Me Away"),
    (3, "Cowboy Take Me Away"),
    (4, "Cowboy Take Me Away")
]

For row 2, GroupByKey would give you a list of elements which you can aggregate on.

["Goodbye Earl", "Cowboy Take Me Away"]

Most likely you would want to parse the BigQuery rows into a data class. That way you can identify which ones are original and which ones are copies.

How to implement a lookback in Apache Beam / Cloud Dataflow

1 Answers