python - percentiles from counts of values

Question

I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector through numpy.percentile, is there a more efficient way?

My idea would be, first, counting the frequencies of different values (e.g. using scipy.stats.itemfreq), second, combining those item frequencies for the different vectors, and finally, calculating the percentiles from the counts.

Unfortunately I haven't been able to find functions either for combining the frequency tables (it is not very simple, as different tables may cover different items), or for calculating percentiles from an item frequency table. Do I need to implement these, or can I use existing Python functions? What would those functions be?

Have you tried docs.python.org/2/library/collections.html#collections.Counter to count frequencies ? — Julien Palard
You are right! The Counter class can do the first part of what I'd like, as you can add those up. I just need a function to calculate percentiles from a Counter, and that would make the answer complete. — user2443147
@Geza It would be easier, if you posted an example input and wanted output including the code you've tried yourself. — dwitvliet
@Banana Yes, I know that is what you generally do on StackOverflow. But I cannot really post those huge arrays (they are actually parts of long waveform files; but any list or numpy array would do to test code). And I mentioned the functions I have considered; note that I'm not even looking for code, just function names. I think all I can do is link a page explaining what a percentile means. I'll do that. — user2443147
what is the problem with concatenating the vectors? percentiles can be quite expensive to compute so concatenation cost might be amortized. for efficient percentile computation in numpy you need version 1.9 — jtaylor

Unknown Unknown · Accepted Answer · 2014-08-02T11:33:29

Using collections.Counter for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):

from collections import Counter

def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
    """Returns [(percentile, value)] with nearest rank percentiles.
    Percentile 0: <min_value>, 100: <max_value>.
    cnts_dict: { <value>: <count> }
    percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
    """
    assert all(0 <= p <= 100 for p in percentiles_to_calc)
    percentiles = []
    num = sum(cnts_dict.values())
    cnts = sorted(cnts_dict.items())
    curr_cnts_pos = 0  # current position in cnts
    curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
    for p in sorted(percentiles_to_calc):
        if p < 100:
            percentile_pos = p / 100.0 * num
            while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                curr_cnts_pos += 1
                curr_pos += cnts[curr_cnts_pos][1]
            percentiles.append((p, cnts[curr_cnts_pos][0]))
        else:
            percentiles.append((p, cnts[-1][0]))  # we could add a small value
    return percentiles

cnts_dict = Counter()
for segment in segment_iterator:
    cnts_dict += Counter(segment)

percentiles = calc_percentiles(cnts_dict)

python - percentiles from counts of values

2 Answers