3
votes

I'm developing a beam pipeline for dataflow runner. I need the below functionality in my use case.

  1. Read input events from Kafka topic(s). Each Kafka message value derives [userID, Event] pair.
  2. For each userID, I need to maintain a profile and based on the current Event, a possible update to the profile is possible. If the profile is updated:
    • Updated profile is written to output stream.
    • The next Event for that userID in the pipeline should refer to the updated profile.

I was thinking of using the provided state functionality in Beam, without depending on an external key-value store for maintaining the user profile. Is this feasible with the current version of beam (2.1.0) and dataflow runner? If I understand correctly the state is scoped to the elements in a single window firing (i.e even for a GlobalWindow, the state will be scoped to the elements in a single firing of the window caused by a trigger). Am I missing something here?

1

1 Answers

3
votes

State would be perfectly appropriate for your use case.

The only correction is that state is scoped to a single window, but trigger firings do not affect it. So, if your state is small you can store it in a global window. When a new element arrives, you can use use the state, output elements as needed, and make changes to the state.

The only thing to consider would be if you have an unbounded number of user IDs, how big the state may become. For instance, you may want an inactivity timer to clear old user state after some period of time.

If you haven't read them, the blog posts Stateful Processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam provide a good introduction to these concepts and APIs.