44
votes

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.

I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:

from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846

However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.

Any ideas? Cheers, Davy

7
If you run the test using scipy.stats.chisquare, do you get the desired result?Fred Foo
Btw., when I compute the PDF according to the Wikipedia I get the same result as SciPy: x = 3.84; reciprocal(2**.5 * gamma(.5)) * x ** (.5 - 1) * exp(- x / 2)Fred Foo
I think you're using the wrong function . . . as @larsmans mentions, you should probably use the chisquare function, but make sure to pass it array of the actual and expected, and it'll return to you both the 3.84 and the p-valu you're looking for.ernie
Ah, so how about pval = 1 - stats.chi2.cdf(3.84, 1) (saw this over in this thread.ernie
or even better >>> stats.chi2.sf(3.84, 1) 0.050043521248705106 for some gains in precision in the tailJosef

7 Answers

59
votes

Quick refresher here:

Probability Density Function: think of it as a point value; how dense is the probability at a given point?

Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?

In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:

>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147

PDF CDF

27
votes

To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:

>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189

Notice:

chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead

22
votes

Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:

>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311

While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:

>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147

Using 1-chi2.cdf() breaks down here:

>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0

Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:

>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17    
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0
7
votes

You meant to do:

>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147
5
votes

Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)

Example:

from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)
3
votes

If you want to understand the math, the p-value of a sample, x (fixed), is

P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)

where,

  • P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
  • X is a random variable from that normal distribution,
  • m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
  • G is the CDF of the chi^2 distribution w/ k degrees of freedom.

So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).

for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)

In [7]: from scipy.stats import chi2

In [8]: k = 1

In [9]: z = 2

In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528
1
votes

For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:

>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126