2
votes

I'm trying to automate chi squared calculations. I'm using scipy.stats.pearsonr. However, that's giving me different answers than SPSS is. Like, factor of 10 difference. (.07 --> .8)

I'm pretty sure that the data is the same in both cases because I'm printing out the crosstab in both cases (using pandas.crosstab) and the numbers are identical.

d1 = [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1]

d2 = [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1]

print scipy.stats.stats.pearsonr(d1,d2)

gives:

 (-0.065191159985573108, 0.61172152831874682)

(the 1st is the coefficient, the 2nd is the p value)

However SPSS says that the Pearson Chi-Square is .057.

Is there something I should check other than the crosstab?

1
Could you also show the corresponding SPSS code?Warren Weckesser
Someone else made the SPSS, so I only have access to the output easily...Brian Postow

1 Answers

6
votes

Apparently you are computing the chi-squared statistic and p-value for the contingency table (i.e. "cross tab") of the data. The scipy function pearsonr is not the correct function to use for this. To do the calculation with scipy, you'll need to form the contingency table and then use scipy.stats.chi2_contingency.

There are several ways you could convert d1 and d2 into a contingency table. Here I'll use the Pandas function pandas.crosstab. Then I'll use chi2_contingency for the chi-squared test.

First, here is your data. I have them in numpy arrays, but this is not necessary:

In [49]: d1
Out[49]: 
array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1])

In [50]: d2
Out[50]: 
array([1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1])

Use pandas to form the contingency table:

In [51]: import pandas as pd

In [52]: table = pd.crosstab(d1, d2)

In [53]: table
Out[53]: 
col_0   0   1  2
row_0           
0       5   7  4
1      10  34  3

Then use chi2_contingency for the chi-squared test:

In [54]: from scipy.stats import chi2_contingency

In [55]: chi2, p, dof, expected = chi2_contingency(table.values)

In [56]: p
Out[56]: 0.057230732412525138

The p value matches the value computed by SPSS.


Update: In SciPy 1.7.0 (targeted for mid-2021), you'll be able to create the contingency table with scipy.stats.contingency.crosstab:

In [33]: from scipy.stats.contingency import crosstab  # Will be in SciPy 1.7.0 
In [34]: d1                                                                                                 
Out[34]: 
array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1])

In [35]: d2                                                                                              
Out[35]: 
array([1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1])

In [36]: (vals1, vals2), table = crosstab(d1, d2)                                                                          

In [37]: vals1                                                                                                      
Out[37]: array([0, 1])

In [38]: vals2                                                                                              
Out[38]: array([0, 1, 2])

In [39]: table                                                                                           
Out[39]: 
array([[ 5,  7,  4],
       [10, 34,  3]])