Finding the medians of multiple subarrays in an unsorted array

Question

Suppose you are given an unsorted array of integers S and a list of ranges in T, return a list of medians from each of the ranges.

For example, S = [3,6,1,5,0,0,1,-2], T = [[1,3],[0,5],[4,4]]. Return [5, 2, 0].

Is there a better approach than running Median of Medians on each range? Can we somehow precompute/cache the results?

what does a range means? when you query with range [1, 3], do you mean finding median of S[1], S[2], S[3] or median of first, second, third smallest of S? — Petar Petrovic

Tobias Ribizel Tobias Ribizel · Accepted Answer · 2018-01-10T00:22:35

Let me introduce you to an interesting data structure called Wavelet Tree:

You build it by looking at the bit-string representation of your integers and recursively bisecting them:

You first separate your integers into those with most significant bit (MSB) 0 and those with MSB 1. However you store the MSBs in their original order in a bitvector. Then for each of these subsets of integers, you ignore the MSB and recursively repeat this construction for the next-most significant bit.

If you repeat this down to the least significant bit, you get a tree structure like this (note that the indices are just there for illustration, you should store only the bitvectors):

You can easily see that the construction of this data structure takes O(n log N) time where n is the number of integers and N is their maximum value.

Wavelet trees have the nice property that they represent the original sequence as well as their sorted counterpart at the same time:

If you read the topmost bitvector, you get the MSBs of the input sequence. To reconstruct the next bit of the entries, you can alternate between looking in the bitvector in the root's left child (if the MSB is 0) or in the right child (if the MSB is 1). For the following bits, you can continue recursively.
If you read the leaf nodes from left to right, you get the sorted sequence.

To use a Wavelet tree efficiently, you need two fundamental operations on the bitvectors:

rank1(k) tells you how many 1s come before the kth position in the bitvector, rank0 does the same for 0s
select1(k) tells you the index of the kth 1 in the bitvector, select0 does the same for 0s

Note that there are bitvector representations that require only o(n) (small o) bits of additional storage to implement these operations in O(1)

You can utilize them as follows:

If you are looking at the first 7 in the sequence above, it has index 3. If you now want to know which index it has in the right child node, you simply call rank1(3) on the root bitvector and get 2, which is exactly the index of the first 7 in the right child
If you are at the child containing 4544 and want to know the position of the second 4 (with index 2) in the parent node containing 46754476, you call select0(2) on the parent's bitvector and get the index 5.

Now how can you implement a range median query with this? The most important realization you need to make is that finding the median of a range of size k is equivalent to selecting the k/2 th element.

The basic idea of the algorithm is similar to Quickselect: Bisect the element range and recurse only into the range containing the element you are looking for.

Let's say we want to find the median of the range starting at the second 2 (inclusive) and ending at the 1 (exclusive). These are 7 elements, thus the median has rank 4 (fourth-smallest element) in that range. Now using a rank0/1 call in the root bitvector at the beginning and end of this range, we find the corresponding ranges in the children of the root:

As you can see, the left range (which contains only smaller elements) has only 3 elements, thus the element with rank 4 must be contained in the right child of the root. We can now recursively search for the element with rank 4 - 3 = 1 in that right child. By recursively descending the wavelet tree until you reach a leaf, you can thus identify the median with only two rank operations (à O(1) time) per level of the Wavelet tree, thus the whole range median query takes O(log N) time where N is the maximum number in your input sequence.

If you want to see a practical implementation of these Wavelet trees, have a look at the Succinct Data Structures Library (SDSL) which implements the aforementioned bitvectors and different WT variants.

Finding the medians of multiple subarrays in an unsorted array

1 Answers