0
votes

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc. I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?

I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards

Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes

1

1 Answers

0
votes

That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:

  • context.outputWithTimestamp() in your DoFn;
  • WithTimestamps PTransform;

You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.

See: