find median in a fixed-size moving window along a long sequence of data

12

votes

Given a sequence of data (it may have duplicates), a fixed-sized moving window, move the window at each iteration from the start of the data sequence, such that (1) the oldest data element is removed from the window and a new data element is pushed into the window (2) find the median of the data inside the window at each moving.

The following posts are not helpful.

Effectively to find the median value of a random sequence

joining data based on a moving time window in R

My idea:

Use 2 heaps to hold median. In side the window, sort the data in the window in the first iteration, the min heap holds the larger part and the max heap holds the smaller part. If the window has odd number of data, the max heap returns the median otherwise the arithmetic mean of the top elements of the two heaps is the median.

When a new data is pushed in to the window, remove the oldest data from one of the heap and compare the new data with the top of max and min heap so that to decide which heap the data to be put. Then, find the median just like in the first iteration.

But, how to find a data element in a heap is a problem. Heap is a binary tree not a binary search tree.

Is it possible to solve it with O(n) or O(n * lg m) where m is the window size and space: O(1) ?

Any help is really appreciated.

Thanks

c++algorithmdata-structuresmedian

Are stackoverflow.com/questions/5527437/… or stackoverflow.com/questions/1309263/… useful? - Martin Beckett

Is the newest data the next item, or is there some other criteria? Are you processing these items in a first in, first out type of way? - Glenn

@Glenn, each iteration, the oldest data is deleted from the window and a new data is put into the window and then find the new median in the window. For the oldest data , it is FIFO. thanks - user1002288

I don't think, space O(1) is possible. You need to store the window contents, so you won't get below O(m). - hc_

How to do remove the oldest data from one of the heap ? - sam_k

12

votes

O(n*lg m) is easy:

Just maintain your window as two std::sets, one for the lower half, one for the upper half. Insertion of a new element costs O(lg m), finding and removal of an old element costs the same. Determining the median using the method you described in your question costs O(1).

As you slide the window over your sequence, in each iteration you remove the item falling out of the window (O(lg m)), insert the new item (O(lg m)) and compute the median (O(1)), resulting in a total of O(n lg m).

This solution uses space O(m), of course but I don't think you can get away without storing the window's contents.

7

votes

I have implemented almost exactly the algorithm you describe here: http://ideone.com/8VVEa, and described it here: Rolling median in C - Turlach implementation

The way to get around the "find oldest" problem is to keep the values in a circular buffer, so you always have a pointer to the oldest. What you store in the heap are buffer indexes. So the space requirement is 2M, and each update is O(lg M).

1

votes

Same answer as hc_ but instead of using a stock BST use a version where every node has the count of elements in that sub-tree. This way finding median is O(log(m)).

1

votes

I gave this answer for the "rolling median in C" question

I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link ( Match Editorial: scroll down to FloatingMedian).

Two multisets

The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

Segment tree

The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!! (And you only need 65536 * sizeof(counting_type) bytes, e.g. 65536*4).

GNU Order Statistic Trees

Just before giving up, I found that stdlibc++ contains order statistic trees!!!

These have two critical operations:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

See libstdc++ manual policy_based_data_structures_test (search for "split and join").

I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H

0

votes

I attach my segment tree (see my other post) which allows the frequency distribution of counts to be queried very efficiently.

This implements the following data structure:

|-------------------------------|
|---------------|---------------|
|-------|-------|-------|-------|
|---|---|---|---|---|---|---|---|
  0   1   2   3   4   5   6   7

Each segment keeps the number of counts items in the range it covers. I use 2N segments for a range of value from 1..N. These are placed in a single rolled out vector rather than the tree format show figuratively above.

So if you are calculating rolling medians over a set of integers which vary from 1..65536, then you only need 128kb to store them, and can insert/delete/query using O(ln N) where N = the size of the range, i.e. 2**16 operations.

This is a big win if the data range is much smaller than your rolling window.

#if !defined(SEGMENT_TREE_H)
#define SEGMENT_TREE_H
#include <cassert>
#include <array>
#include <algorithm>
#include <set>

#ifndef NDEBUG
#include <set>
#endif

template<typename COUNTS, unsigned BITS>
class t_segment_tree
{
    static const unsigned                       cnt_elements    = (1 << BITS);
    static const unsigned                       cnt_storage     = cnt_elements << 1;
    std::array<COUNTS, cnt_elements * 2 - 1>    counts;
    unsigned                                    count;

#ifndef NDEBUG
    std::multiset<unsigned>                     elements;
#endif
    public:

    //____________________________________________________________________________________

    //  constructor

    //____________________________________________________________________________________
    t_segment_tree(): count(0)
    {
        std::fill_n(counts.begin(), counts.size(),  0);
    }
    //~t_segment_tree();

    //____________________________________________________________________________________

    //  size

    //____________________________________________________________________________________
    unsigned size() const  { return count; }

    //____________________________________________________________________________________

    //  constructor

    //____________________________________________________________________________________
    void insert(unsigned x)
    {
#ifndef NDEBUG
        elements.insert(x);
        assert("...............This element is too large for the number of BITs!!..............." && cnt_elements > x);
#endif
        unsigned ii = x + cnt_elements;
        while (ii)
        {
            ++counts[ii - 1];
            ii >>= 1;
        }
        ++count;
    }

    //____________________________________________________________________________________

    //  erase 

    //      assumes erase is in the set
    //____________________________________________________________________________________
    void erase(unsigned x)
    {
#ifndef NDEBUG
        // if the assertion failed here, it means that x was never "insert"-ed in the first place
        assert("...............This element was not 'insert'-ed before it is being 'erase'-ed!!..............." && elements.count(x));
        elements.erase(elements.find(x));
#endif
        unsigned ii = x + cnt_elements;
        while (ii)
        {
            --counts[ii - 1];
            ii >>= 1;
        }
        --count;
    }

    // 
    //____________________________________________________________________________________

    //  kth element

    //____________________________________________________________________________________
    unsigned operator[](unsigned k)
    {
        assert("...............The kth element: k needs to be smaller than the number of elements!!..............." && k < size());
        unsigned ii = 1;
        while (ii < cnt_storage)
        {
            if (counts[ii - 1] <= k)
               k -= counts[ii++ - 1];
            ii <<= 1;
        }
        return (ii >> 1) - cnt_elements;
    }

};
#endif

find median in a fixed-size moving window along a long sequence of data

5 Answers

Two multisets

Segment tree

GNU Order Statistic Trees