0
votes

I have the following requirement:

  • read events from a pub sub topic
  • take a window of duration 30 mins and period 1 minute
  • in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub topic
  • The event should be raised as soon as the 3rd event comes in for the grouping id as this is for detecting fraudulent behaviour. In one pane there many be many ids that have 3 events that match my predicate so i may need to emit multiple events per pane

I am able to write a function which consumes a PCollection does the necessary grouping, logic and filtering and emit events according to my business logic.

Questions:

  1. The output PCollection contains duplicates due to the overlapping sliding windows. I understand this is the expected behaviour of sliding windows but how can I avoid this whilst staying in the same dataflow pipeline. I realise I could dedupe in an external system but that is just adding complexity to my system.
  2. I also need to write some sort of trigger that fires each and every time my condition is reached in a window
  3. Is dataflow suitable for this type of realtime detection scenario

Many thanks

1

1 Answers

1
votes
  1. You can rewindow the output PCollection into the global window (using the regular Window.into()) and dedupe using a GroupByKey.
  2. It sounds like you're already returning the events of interest as a PCollection. In order to "do something for each event", all you need is a ParDo.of(whatever action you want) applied to this collection. Triggers do something else: they control what happens when a new value V arrives for a particular key K in a GroupByKey<K, V>: whether to drop the value, or buffer it, or to pass the buffered KV<K, Iterable<V>> for downstream processing.
  3. Yes :)