2
votes

I have two arrays that I would like to do a Pearson's Chi Square test (goodness of fit). I want to test whether or not there is a significant difference between the expected and observed results.

observed = [11294, 11830, 10820, 12875]
expected = [10749, 10940, 10271, 11937]

I want to compare 11294 with 10749, 11830 with 10940, 10820 with 10271, etc.

Here's what I have

>>> from scipy.stats import chisquare
>>> chisquare(f_obs=[11294, 11830, 10820, 12875],f_exp=[10749, 10940, 10271, 11937])
(203.08897607453906, 9.0718379533890424e-44)

where 203 is the chi square test statistic and 9.07e-44 is the p value. I'm confused by the results. p-value = 9.07e-44 < 0.05 therefore we reject the null hypothesis and conclude that there is a significant difference between the observed and expected results. This isn't correct because the numbers are so close. How do I fix this?

1
I get a low value from p-value tables and an identical answer from mathematica as well. There's nothing wrong with the answer you're getting.Asad Saeeduddin
I think this is the same problem as this question (just asked differently): use chi2_contingency instead.user707650
@Evert: That is not the same question. In a contingency table, all the given frequencies are observed frequencies, and the expected frequencies are inferred from the observed frequencies. In the question here, both the observed and expected frequencies are given, so it is not a contingency table problem.Warren Weckesser

1 Answers

3
votes

In general, the null hypothesis(H0) says that the two variable(X and Y) are independent, i.e. changing values in X wouldn't affect values in Y.

For example, X = [1,2,3,4] and Y = [2,4,6,8]

If you calculate the "p-value" using any method out there for this case, it should come out to be a very small value, implying that there is a very low chance of this case following the null hypothesis, i.e. a very low chance that X and Y are independent of each other.

It means it will never follow the Null Hypothesis here and these two variables are dependent on each other, in a form of Y = 2X.

In your case also, p-value score of 9.0718379533890424e-44 means the same thing, i.e. small value indicates that there is a very low chance it would suffice the null hypothesis and it means that observed and expected are related to each other and there is no independence between them.

Ps. You are correct about this.