5
votes

I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector through numpy.percentile, is there a more efficient way?

My idea would be, first, counting the frequencies of different values (e.g. using scipy.stats.itemfreq), second, combining those item frequencies for the different vectors, and finally, calculating the percentiles from the counts.

Unfortunately I haven't been able to find functions either for combining the frequency tables (it is not very simple, as different tables may cover different items), or for calculating percentiles from an item frequency table. Do I need to implement these, or can I use existing Python functions? What would those functions be?

2
You are right! The Counter class can do the first part of what I'd like, as you can add those up. I just need a function to calculate percentiles from a Counter, and that would make the answer complete. - user2443147
@Geza It would be easier, if you posted an example input and wanted output including the code you've tried yourself. - dwitvliet
@Banana Yes, I know that is what you generally do on StackOverflow. But I cannot really post those huge arrays (they are actually parts of long waveform files; but any list or numpy array would do to test code). And I mentioned the functions I have considered; note that I'm not even looking for code, just function names. I think all I can do is link a page explaining what a percentile means. I'll do that. - user2443147
what is the problem with concatenating the vectors? percentiles can be quite expensive to compute so concatenation cost might be amortized. for efficient percentile computation in numpy you need version 1.9 - jtaylor

2 Answers

4
votes

Using collections.Counter for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):

from collections import Counter

def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
    """Returns [(percentile, value)] with nearest rank percentiles.
    Percentile 0: <min_value>, 100: <max_value>.
    cnts_dict: { <value>: <count> }
    percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
    """
    assert all(0 <= p <= 100 for p in percentiles_to_calc)
    percentiles = []
    num = sum(cnts_dict.values())
    cnts = sorted(cnts_dict.items())
    curr_cnts_pos = 0  # current position in cnts
    curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
    for p in sorted(percentiles_to_calc):
        if p < 100:
            percentile_pos = p / 100.0 * num
            while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                curr_cnts_pos += 1
                curr_pos += cnts[curr_cnts_pos][1]
            percentiles.append((p, cnts[curr_cnts_pos][0]))
        else:
            percentiles.append((p, cnts[-1][0]))  # we could add a small value
    return percentiles

cnts_dict = Counter()
for segment in segment_iterator:
    cnts_dict += Counter(segment)

percentiles = calc_percentiles(cnts_dict)
2
votes

The same question has been bothering me for a long time and I decided to make an effort. The idea was to reuse something from scipy.stats, so that we would have cdf and ppf out of the box.

There is a class rv_descrete meant for subclassing. Browsing the sources for something similar in its inheritors I found rv_sample with an interesting description: A 'sample' discrete distribution defined by the support and values.. The class is not exposed in API, but it is used when you pass values directly to the rv_descrete.

Thus, here is a possible solution:

import numpy as np
import scipy.stats

# some mapping from numeric values to the frequencies
freqs = np.array([
    [1, 3],
    [2, 10],
    [3, 13],
    [4, 12],
    [5, 9],
    [6, 4],
])

def distrib_from_freqs(arr: np.ndarray) -> scipy.stats.rv_discrete:
    pmf = arr[:, 1] / arr[:, 1].sum()
    distrib = scipy.stats.rv_discrete(values=(arr[:, 0], pmf))
    return distrib

distrib = distrib_from_freqs(freqs)

print(distrib.pmf(freqs[:, 0]))
print(distrib.cdf(freqs[:, 0]))
print(distrib.ppf(distrib.cdf(freqs[:, 0])))  # percentiles

# [0.05882353 0.19607843 0.25490196 0.23529412 0.17647059 0.07843137]
# [0.05882353 0.25490196 0.50980392 0.74509804 0.92156863 1.        ]
# [1. 2. 3. 4. 5. 6.]

# max, median, 1st quartile, 3rd quartile
print(distrib.ppf([1.0, 0.5, 0.25, 0.75]))
# [6. 3. 2. 5.]

# the distribution describes values from (0, 1] 
#   and 0 results with a value right before the minimum:
print(distrib.ppf(0))
# 0.0