3
votes

I found this website: http://www.cluster-text.com/confidence_interval.php that calculates a confidence interval for a hypergeometric distribution, which linked to Wikipedia's Clopper-Pearson interval section here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval

I'm now wondering how to do the same calculation in Python. I assume the two large expressions below the "sentence" "The beta distribution is, in turn, related to the F-distribution so a third formulation of the Clopper-Pearson interval can be written using F quantiles:" are the lower and upper boundaries. For some reason, the formula doesn't seem to care about the population size at all...

The formula presented in Wikipedia calls an F function, which I thought is some of the functions in scipy.stats.f. I wrote a program to try all of them and I found that using stats.f.ppf as the "mysterious" F function gives the best results.

The function that I managed to write gives results that are fairly close to the results given by the website, but not quite there (probably because the population size is ignored). What do I need to do to fix this and get the correct confidence intervals? For example: according to the site, if the population size is 80, the sample size is 16 and you find 8 successes, the 95% confidence interval is (27.5, 72.5). However, my function gives (24.651011149057535, 75.348988850942462). Fairly close, but not quite. I've included two other examples in the code.

Here's my code. I've included the (perhaps silly) brute-force attempt to find the correct F function to use. [I'm not a professional programmer nor a professional statistician... I tried my best, but it's hard]

from scipy import stats

def hypergeom_ci(population_size, sample_size, successes, p):
    n = sample_size
    x = successes
    alpha = 1-p
    lower = (1+(n-x+1)/(x*F(alpha/2, 2*x, 2*(n-x+1))))**(-1)
    upper = (1+(n-x)/((x+1)*F(1-alpha/2, 2*(x+1), 2*(n-x))))**(-1)
    return 100 * lower, 100 * upper

population_size, sample_size, successes, p = 80, 16, 8, 0.95

for pk in dir(stats.f): # Try all functions in stats.f and print results
    try:
        F = eval('stats.f.'+pk)
        print(pk, hypergeom_ci(population_size, sample_size, successes, p), sep=' -> ')
    except:
        pass

pk = 'ppf' # I found that stats.f.ppf gives the closest and most reasonable results
F = eval('stats.f.'+pk)

attempts = [(population_size, sample_size, successes, p), (1000000, 70, 8, 0.95), (8883, 70, 8, 0.92)]
h3 = (5.583699200720477, 20.17336485421592)
h1 = (27.5, 72.5)
h2 = (5.0654, 21.2824)

sitegives = [h1, h2, h3] # Answers given by the calculator at http://www.cluster-text.com/confidence_interval.php

for trial, answer in zip(attempts, sitegives):
    print(trial, hypergeom_ci(*trial), answer)
1
Well, isn't this difference simply due to the fact that the Clopper-Pearson interval on the wikipedia page is for the binomial distribution (sampling with replacement) rather than the hypergeometric distribution (sampling without replacement)? That also explains why the population size doesn't enter. Also, looking at the Javascript sourcecode of that site, it doesn't seem to be even remotely related to what the wikipedia section is about.Stefan Zobel

1 Answers

0
votes

It's not a direct answer for it but I've made a package calculating confidence interval for hypergeometric distribution. Still a bit buggy, so please use it with caution if you use it.

Github: https://github.com/KeisukeNagakawa/cisim/blob/master/README.rst

PyPI: https://pypi.org/project/cisim/

I'd like the exact solution like you've described too.