3
votes

I've been studying binary heaps, and they are obviously a good data structure for a priority queue. Let's say that my stream of data has millions (N) of records, and I'm periodically interested in the top 1000 (k << N) records by rank. With enough space, I would just maintain an N-sized binary heap, and each insertion would be O(log N). What I'd like to do, though, is prune the tree on every insertion, i.e. discard the 1001st element. It's not obvious to me how to do pruning in less than O(k) time.

(If I were happy with O(k) time for each pruning (and insertion), I'd just maintain an ordered list of k elements, not a heap.)

One idea is to have two parallel heaps, one which keeps mins, the other keeping maxes, where both keep only the top 1000 elements. It's a bit ugly, though.

Just to clarify, these are my constraints:

  • insertion: ideally less than ~1000 ops (so rules out primitive list)
  • storage: limited, need to prune unpopular items roughly at rate of insertion (some constant-ish overhead is ok)
  • querying top 1000: top 1000 items don't have to be perfectly sorted, heap-ish order is fine
3
Can you state your question a little more clearly? (Hint: questions end in question marks)PengOne
If you prune after every k steps (k = 1000 is the number of items to select) with a selection algorithm, then each insert is O(1). (This is a comment because you specified every step.)Per
Basically, it comes down to this: - A balanced binary tree is great for a dynamic situation where you want to keep the top 1000 items over time, and you need to query them fairly frequently. - A Min-Max heap overcomes some of the limitations of a simple Min-Heap. - For a more static situation where you just want to get the top 1000 items from a large dataset, you essentially use an upside-down heap, where the head element is no longer the "best" item, but instead the head element is the gatekeeper to the top 1000, with ordering reversed such that "better" items are deeper down.Steve Howell
Is it not enough to have a 2047-item binary heap and whenever you add a new item and the heap is already full, just replace the last (2047th) element with the item you are inserting?Antti Huima

3 Answers

4
votes

You can do this very easily with a binary heap.

Say you have a stream of items of some unknown size, and you want to find the top 1,000 items. Here's the idea.

initialize heap
while (items to be read)
{
    read item
    if (heap.count < 1000 OR item > heap.Peek())
    {
        // Either we haven't added 1,000 items yet,
        // or the new item is larger than the smallest
        // item on the heap.
        heap.Add(item)
        if (heap.count > 1000)
        {
            // trim the heap
            // This makes sure that the heap doesn't
            // grow too large.
            heap.RemoveFirst()
        }
     }
}

( heap.Peek() examines but does not remove the lowest item on the heap).

When you're done, the heap will contain the top 1,000 items by rank.

This cannot be done in O(N) time. The complexity of this algorithm is O(N log k), where k is the size of the heap.

By the way, you won't maintain an ordered list in O(N) time, either.

Another option, if you can keep all 1,000,000 items in an array, is Quickselect. It runs in O(N) time, but I've found that when k is small in comparison to N, the heap selection technique is faster. See When theory meets practice for the details.

If you can't keep all the items in memory (i.e. you're working with a stream of data), then the heap selection technique is the best you can do. You could do the same thing with a skip list, which would also be O(n log k), but the skip list might perform slightly better than a binary heap.

By the way, that O(n log k) is the worst case, which would happen if the items were presented to the heap in sorted order. In that case, every item is added to the heap. If items are are distributed more normally, the majority of items don't get past the heap.Peek() test. My tests show that with a normal distribution, only about 10% of the items (when selecting 1,000 from 1,000,000) pass that first test. Again, more information is available in the blog post I linked above.

2
votes

Sounds like you need a Min-Max heap.

That gives you O(log(n)) operations for both delete min and delete max, which should help you achieve your goal.

1
votes

Heap is not good for searching items, and it doesn't preserve order of elements to keep first 1000 elements, you can do this with balanced binary search tree in O(n).

Edit: Also Idea of using min heap to get biggest item is fine enough, and I wasn't aware of that, but I prefer BST.