INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100]
stores how many 100
s are there in your input sequence). Basically you can achieve it in the following steps:
- create buckets for each number lies in any of your sub-ranges.
- iterate through all elements, for each number
n
, if we have bucket[n]
, then bucket[n]++
.
- compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10]
, and you would like to compute the median. The bucket approach basically computes how many 0
s are there in your inputs, and how many 1
s are there in your inputs and so on. Suppose there are n
numbers lies in range [0, 10]
, then the median is the n/2
th largest element, which can be identified by finding the i
such that bucket[0] + bucket[1] ... + bucket[i]
greater than or equal to n/2
but bucket[0] + ... + bucket[i - 1]
is less than n/2
.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i]
counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]
), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K
by 1
in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN
.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
0,1,2,100
is not 50. – Bernhard Barker