1
votes

The weighted median of a sample is the 50% weighted percentile (see this post @ crossvalidated for more info)/

I was wondering how one would extend the algorithm used to find the median of a running stream of numbers detailed here (with two heaps, a min heap for the left side and a max heap for the right side) to efficiently calculate the weighted median from a stream of double values and weights.

One idea I had was to use the same method as when calculating the median from an unweighted stream of numbers, but simply put in extra values if the weights are not one (e.g. a value with a weight of 2 would be inserted twice). However, this doesn't scale well with weights that can be doubles, and also seems quite memory inefficient.

Thanks!

2

2 Answers

0
votes

One approach with O(nlogn) complexity would be to insert the nodes into an augmented balanced binary search tree. The tree would be sorted by value, and each node in the tree would be augmented by having a field that gave the total weight of all daughter nodes.

It costs O(logn) to insert a new node including updating all the total weight fields.

To look up the median you descend the tree based on a target weight of total weight divided by 2. This search will take O(logn).

0
votes

I ended up implementing a method that uses a sorted array (essentially serves the function of a min-heap, but with easier search) and continually keeps track of where the fiftieth percentile of the total weights are. I wrote a blog post about it that has more in-depth examples.