I'm currently having a streaming pipeline processing events and pushing them to a BigQuery table named EventsTable:
TransactionID EventType 1 typeA 1 typeB 1 typeB 1 typeC 2 typeA 2 typeC 3 typeA
I want to add a branch to my processing pipeline and also "group" together transaction-related data into a TransactionsTable. Roughly, the value in the type columns in the TransactionsTable would be the count of the related eventType for a given transaction. With the previous example events, the output would look like this:
TransactionID typeA typeB typeC 1 1 2 1 2 1 0 1 3 1 0 0
The number of "type" columns would be equal to the number of different eventType that exists within the system.
I'm trying to see how I could do this with Dataflow, but cannot find any clean way to do it. I know that PCollections are immutable, so I cannot store the incoming data in a growing PCollection structure that would queue the incoming events up to the moment where the needed other elements are present and that I could write them to the second BigQuery table. Is there a sort of windowing function that would allow to do this with Dataflow (like queuing the events in a temporary windowed structure with a sort of expiry date or something)?
I could probably do something with batched jobs and PubSub, but this would be far more complex. On the other hand, I do understand that Dataflow is not meant to have ever growing data structures and that the data, once it goes in, has to go through the pipeline and exit (or be discarded). Am I missing something?