I've been studying binary heaps, and they are obviously a good data structure for a priority queue. Let's say that my stream of data has millions (N) of records, and I'm periodically interested in the top 1000 (k << N) records by rank. With enough space, I would just maintain an N-sized binary heap, and each insertion would be O(log N). What I'd like to do, though, is prune the tree on every insertion, i.e. discard the 1001st element. It's not obvious to me how to do pruning in less than O(k) time.
(If I were happy with O(k) time for each pruning (and insertion), I'd just maintain an ordered list of k elements, not a heap.)
One idea is to have two parallel heaps, one which keeps mins, the other keeping maxes, where both keep only the top 1000 elements. It's a bit ugly, though.
Just to clarify, these are my constraints:
- insertion: ideally less than ~1000 ops (so rules out primitive list)
- storage: limited, need to prune unpopular items roughly at rate of insertion (some constant-ish overhead is ok)
- querying top 1000: top 1000 items don't have to be perfectly sorted, heap-ish order is fine