Why do np.corrcoef(x) and df.corr() give different results?

Question

Why the numpy correlation coefficient matrix and the pandas correlation coefficient matrix different when using np.corrcoef(x) and df.corr()?

x = np.array([[0, 2, 7], [1, 1, 9], [2, 0, 13]]).T
x_df = pd.DataFrame(x)
print("matrix:")
print(x)
print()
print("df:")
print(x_df)
print()

print("np correlation matrix: ")
print(np.corrcoef(x))
print()
print("pd correlation matrix: ")

print(x_df.corr())
print()

Gives me the output

matrix:
[[ 0  1  2]
 [ 2  1  0]
 [ 7  9 13]]

df:
   0  1   2
0  0  1   2
1  2  1   0
2  7  9  13

np correlation matrix: 
[[ 1.         -1.          0.98198051]
 [-1.          1.         -0.98198051]
 [ 0.98198051 -0.98198051  1.        ]]

pd correlation matrix: 
          0         1         2
0  1.000000  0.960769  0.911293
1  0.960769  1.000000  0.989743
2  0.911293  0.989743  1.000000

I'm guessing they are different types of correlation coefficients?

np.corrcoef(x.T)==x_df.corr() or print(np.corrcoef(x, rowvar=False)) — Alex Alex

Paul Brennan Paul Brennan · Accepted Answer · 2021-01-28T01:14:14

@AlexAlex is right, you are taking a different set of numbers in the correlation coefficients.

Think about it in a 2x3 matrix

x = np.array([[0, 2, 7], [1, 1, 9]])
np.corrcoef(yx)

gives

array([[1.        , 0.96076892],
       [0.96076892, 1.        ]])

and

x_df = pd.DataFrame(yx.T)
print(x_df)
x_df[0].corr(x_df[1])

gives

   0  1
0  0  1
1  2  1
2  7  9

0.9607689228305227

where the 0.9607... etc numbers match the output of the NumPy calculation.

If you do it the way in your calculation it is equivalent to comparing the correlation of the rows rather than the columns. You can fix the NumPy version using .T or the argument rowvar=False

Why do np.corrcoef(x) and df.corr() give different results?

1 Answers