2
votes

First of all, I am new to stream processing frameworks. I would like to benchmark some of them so I've started with Flink.

For my use case, I need to compare events from a window t with events from the window t-1, both of size 15 minutes, and then do some aggregations.

Here is a simplified version of my use case:

We consider the analyzed events as a tuple of the form . In window 1 we have: (A,1), (B,2), (C,3) and in window 2 we have: (D,6) and (B,7). Then, I need to compare events from the current window with those from the previous window and keep those verifying the following condition: Win2(K) - Win1(K) > 5. So with the previous example we get (B,5). (If there are 2 events with the same key, I need to sum them.)

I don't really know how to keep both of the windows in memory. I was thinking of making a tumbling window of 15 minutes (window t) and a 30 minute sliding window that slides by 15 minutes and doing a minus operation on them to compute window t-1.

Is this a good solution or is there a better way to do it?

1

1 Answers

0
votes

An alternative to the 30-minute sliding window you've proposed would be to use a ProcessFunction. This is a low-level operation Apache Flink provides since version 1.2 that combines state, per-element processing, and timers. In the case of a keyed-stream, the state and timers are automatically scoped on a per-key basis. Here's an outline of how this might work:

State:
store the latest value and timestamp (implicitly this will be for each key)

As each element arrives:
1. if the state (for this key) holds the previous element and the difference is greater than 5, emit something appropriate
2. update the stored value and timestamp
3. set a timer to fire 16 minutes later

When a timer fires:
if the stored state is > 15 minutes old, clear it

If the key space is small you might decide not to bother with the timers -- they are there so that you aren't keeping a potentially unbounded amount of storage pertaining to stale keys.

For more info, see the docs on ProcessFunction and working with state.

In this proposal I've ignored what you said about multiple elements with the same key, but it shouldn't be difficult to adjust for that. (I've also assumed that by the time the data reaches this part of your pipeline, it's in-order (wrt to time), at least on a per-key basis.)

I'm not suggesting ProcessFunction is simpler than your 30-minute sliding window proposal, but it may be more flexible/adaptable. Another approach, which would be simpler, would be to use Flink's Complex Event Processing library. In Flink 1.3 I think it will be possible to express what you're doing using CEP, but please note that version 1.3 won't be released for a few more weeks. You can find the docs for 1.3 here.