I found this website: http://www.cluster-text.com/confidence_interval.php that calculates a confidence interval for a hypergeometric distribution, which linked to Wikipedia's Clopper-Pearson interval section here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval
I'm now wondering how to do the same calculation in Python. I assume the two large expressions below the "sentence" "The beta distribution is, in turn, related to the F-distribution so a third formulation of the Clopper-Pearson interval can be written using F quantiles:" are the lower and upper boundaries. For some reason, the formula doesn't seem to care about the population size at all...
The formula presented in Wikipedia calls an F function, which I thought is some of the functions in scipy.stats.f. I wrote a program to try all of them and I found that using stats.f.ppf as the "mysterious" F function gives the best results.
The function that I managed to write gives results that are fairly close to the results given by the website, but not quite there (probably because the population size is ignored). What do I need to do to fix this and get the correct confidence intervals? For example: according to the site, if the population size is 80, the sample size is 16 and you find 8 successes, the 95% confidence interval is (27.5, 72.5). However, my function gives (24.651011149057535, 75.348988850942462). Fairly close, but not quite. I've included two other examples in the code.
Here's my code. I've included the (perhaps silly) brute-force attempt to find the correct F function to use. [I'm not a professional programmer nor a professional statistician... I tried my best, but it's hard]
from scipy import stats
def hypergeom_ci(population_size, sample_size, successes, p):
n = sample_size
x = successes
alpha = 1-p
lower = (1+(n-x+1)/(x*F(alpha/2, 2*x, 2*(n-x+1))))**(-1)
upper = (1+(n-x)/((x+1)*F(1-alpha/2, 2*(x+1), 2*(n-x))))**(-1)
return 100 * lower, 100 * upper
population_size, sample_size, successes, p = 80, 16, 8, 0.95
for pk in dir(stats.f): # Try all functions in stats.f and print results
try:
F = eval('stats.f.'+pk)
print(pk, hypergeom_ci(population_size, sample_size, successes, p), sep=' -> ')
except:
pass
pk = 'ppf' # I found that stats.f.ppf gives the closest and most reasonable results
F = eval('stats.f.'+pk)
attempts = [(population_size, sample_size, successes, p), (1000000, 70, 8, 0.95), (8883, 70, 8, 0.92)]
h3 = (5.583699200720477, 20.17336485421592)
h1 = (27.5, 72.5)
h2 = (5.0654, 21.2824)
sitegives = [h1, h2, h3] # Answers given by the calculator at http://www.cluster-text.com/confidence_interval.php
for trial, answer in zip(attempts, sitegives):
print(trial, hypergeom_ci(*trial), answer)
binomial
distribution (sampling with replacement) rather than thehypergeometric
distribution (sampling without replacement)? That also explains why the population size doesn't enter. Also, looking at the Javascript sourcecode of that site, it doesn't seem to be even remotely related to what the wikipedia section is about. – Stefan Zobel