Your question contains several aspects, but let me answer the most fundamental one:
Is this a hard task for Flink, why is this not a standard example?
Yes, the median is a hard concept, as the only way to determine it is to keep the full data.
Many statistics don't need the full data to be calculated. For instance:
- If you have the total sum, you can take the previous total sum and add the latest observation.
- If you have the total count, you add 1 and have the new total count
- If you have the average, under the hood you can just keep track of the total sum and count, and at any point calculate the new average based on an observation.
This can even be done with more complicated metrics, like the standard deviation.
However, there is no shortcut for determining the median, the only way to know what the median is after adding a new observation, is by looking at all observations and then figuring out what the middle one is.
As such, it is a challenging metric and the size of the data that comes in will need to be handled. As mentioned there may be estimates in the workings like this: https://issues.apache.org/jira/browse/FLINK-2147
Alternately, you could look at how your data is distributed, and perhaps estimate the median with metrics like Mean, Skew, and Kurtosis.
A final solution I could come up with, is if you need to know approximately what the value should be, is to pick a few 'candidates' and count the fractin of observations below them. The one closest to 50% would then be a reasonable estimate.