I'm implementing an ETL pipeline based on Apache Beam using Python, running on Google Cloud Dataflow. The process itself is simple:
1. Read data from BigQuery using BigQuerySource
2. Transform data
2.1. Performing basic transforms on row level
2.2. Calculate various parameters based on historical data
3. Write data into another BigQuery table using BigQuerySink
The problem is 2.2. - in order to calculate the parameters, I don't only need the current row, but also the previous X (typically 100-500) rows from the same table.
Of course, I could just run another query to load the data, but that would be very inefficient. What's the most efficient and easy way to implement this?