2
votes

I am designing a basket abandoning system for an Ecommerce company. The system will send a message to a user based on the below rules:

  • There is no interaction by the user on the site for 30 minutes.
  • Has added more than $50 worth of products to the basket.
  • Has not yet completed a transaction.

I use Google Cloud Dataflow to process the data and decide if a message should be sent. I have couple of options in below:

  1. Use a Sliding window with a duration of 30 minutes.
  2. A global window with a time based trigger with a delay of 30 minutes.

I think Sliding Window might work here. But my question is, can there be a solution based on using a global window with a processing time based trigger and a delay for this usecase? As far as i understand the triggers based on Apache Beam documentation => Triggers allow Beam to emit early results, before a given window is closed. For example, emitting after a certain amount of time elapses, or after a certain number of elements arrives. Triggers allow processing late data by triggering after the event time watermark passes the end of the window.

So, for my use case and as per the above trigger concepts, i don't think the trigger can be triggered after a set delay for each and every user (It is mentioned in above - can emit only after a certain number of elements it is mentioned above, but not sure if that could be 1). Can you confirm?

3
Can you please try to give the solution (maybe in pseudo code) based on sliding window so I can understand your use case and help see if global window could work?Rui Wang
Sure, i will try to add, but in general is a trigger for each element possible in a global window based on delay?Roshan Fernando
Triggers (particularly processing time trigger and data driven trigger) can be set on global window with unbounded data (otherwise there will never have data emitted because global window never close). If you could provide more information on your use case, that would be helpful to confirm.Rui Wang
For whatever it's worth, the original post is a question copied and pasted from the Google Cloud Professional Data Engineer practice exam.Miles Erickson

3 Answers

4
votes

Both answers 1 - Sliding Windows and 2 - Global Window are incorrect

Sliding windows is not correct because - assuming there is one key per user, a message will be sent 30 minutes after they first started browsing even if they are still browsing

Global Windows is not correct because - it will cause messages to be sent out every 30 minutes to all users regardless of where they are in their current session

Even Fixed Windows would be incorrect in this case, because assuming there is one key per user, a message will be sent every 30 minutes

Correct answer would be - Use a session window with a gap duration of 30 minutes This is correct because it will send a message per user after that user is inactive for 30 minutes

1
votes

I think that sliding window is the correct approach from what you described, and I don't think you can solve this with trigger+delay. If event time sliding windowing makes sense from your business logic perspective, try to use it first, that's what it's for.

My understanding is that while you can use a trigger to produce early results, it is not guaranteed to fire at specific (server/processing) time or with exact number of elements (received so far for the window). The trigger condition enables/unblocks the runner to emit the window contents but it doesn't force it to do so.

In case of event time this makes sense, as it doesn't matter when the event arrives or when the trigger fires, because if the element has a timestamp within a window, then it will be assigned to the correct window no matter when it arrives. And when the trigger will fire for the window, the element will be guaranteed to be in that window if it has arrived.

With processing time you can't do this. If event arrives late, it will be accounted for at that time, and will be emitted next time the trigger fires, basically. And because the trigger doesn't guarantee the exact moment it fires you can potentially end up with unexpected data belonging to unexpected emitted panes. It is useful to get the early results in general but I am not sure if you can reason about windowing based on that.

Also, trigger delay only adds a fire delay (e.g. if it was supposed to fire at 12pm, not it will fire at 12.05pm) but it doesn't allow you to reliably stagger multiple trigger firings so that it goes off at specific intervals.

You can look at the design doc for triggers here: https://s.apache.org/beam-triggers , and possibly lateness doc may be relevant as well: https://s.apache.org/beam-lateness

Other docs can be found here, if you are interested: https://beam.apache.org/contribute/design-documents/ .

Update:

Rui pointed that this use case can be more complicated and probably not easily solvable by sliding windows. Maybe it's worth looking into session windows or manual logic on top of keys+state+timers

1
votes

I find state[1] and timer[2] doc of Apache Beam, which should be able to handle this specific use case without using processing time trigger in global window.

Assuming the incoming data are events of users' actions, and each event(action) can be keyed by user_id.

The nice property that state and timer have is on per key and window basis. So you can accumulate state for each user_id and the state is amount of money in cart in this case. Timer can be set at the first time when amount in cart exceeds $50, and timer can be reset when user still have shopping actions within 30 mins in processing time.

Assume transaction completion is also a user_id keyed event. When an transaction completion event is seen, timer can be deleted[3].


update:

This idea is completely on processing time domain so it will have false alarm messages depending on lateness problem in system. So the question is how to improve this idea to event time domain so we have less false alarm. One possibility is event time based timer[4]. I am not clear what does event time based timer mean at this moment.

[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html

[2] https://docs.google.com/document/d/1zf9TxIOsZf_fz86TGaiAQqdNI5OO7Sc6qFsxZlBAMiA/edit#

[3] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/state/Timers.java#L45

[4] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/state/TimeDomain.java#L33