0
votes

My Dataflow pipeline collates event data into typed per session and per user PCollections output. I have employed GroupByKey for events keyed by session id. Sessions are grouped into parent types keyed by user id and device id using the same pattern at this next level of hierarchy. So a single session might generate many events, but in turn a single user might generate many sessions.

I would now like to summarize this data across each level of the hierarchy. I have used a StateSpec declaration to persist state at the event level. So for example, an event count property can be incremented in my event processing ParDo. (Use Case : generating an error event per session across all users for example.)

But as each ParDo is static - I cannot access the ValueState outside of the ParDo context even though my understanding is this state is maintained at the Window scope. (Maybe this is by design.) Is there a way to access this Window level state using the Beam State persistence lib in another ParDo than where it was originally declared? Like as if I could declare it at the pipeline level?

I understand that this may introduce some performance overhead as the framework must manage concurrency, but the actual processing seems negligible. (Just incrementing values.) So I would prefer to write this to a window level state field rather than percolate values up via my hierarchy.

1
So after spending hours isolating and working at a StatSpec based solution I am abandoning this approach. I don't fully understand why, but I am only able to write to StatSpec ValueState in the first .apply in my pipeline. Later in my pipeline there is a race condition where the .write call does appear to work within the context of the ParDo, but then subsequent reads in the ParDo return null. So perhaps I am misunderstanding something about the API or there is a bug. I'm not sure. Perhaps this happens after I perform a GroupByKey and this affects the StatSpec scope somehow.user3205931
There was always going to be a bottleneck writing to a single StatSpec and so I will attempt to summarize values in another downstream aggregator.user3205931

1 Answers

0
votes

State sharing cross ParDos is not supported, and it shouldn't even be encouraged as it brings dependencies among ParDos that breaks the simple contract: ParDo can work on PCollection independently thus unblocks massive parallelism.