Is there any way to sort the DataStream efficiently other than WindowAll operation?
Let's take the use case of page view that is defined in the Flink example ClickEventCount. This example tries to aggregate the page view count for every 15 minutes window.
What is the efficient way if I want to change this to get the top 3 page views for 15 windows?
One option is to use windowAll function after the aggregate function, and do the in-memory sort. The problem is that - windowAll will reduce the parallelism to 1 and it demands to keep all the data in the same task slot to perform sort operation. Ideally, this is fine when we have few distinct keys (ie - distinct page URL). In my use case, there will be millions or billions of keys in 15 minutes window, so, all those million or billions aggregated rows have to pass through the network and it may cause CPU intensive operation to take just 3.
Is there any way something like, take top 3 page views from its own task slot locally then use windowAll function to receive 3 elements from each task slot then just do the sort operation to pick the top 3 page views? In this method, the network time will be reduced also the sorting time will be less. Do we have such API or any possible method to achieve this use case efficiently?