3
votes

Why the numpy correlation coefficient matrix and the pandas correlation coefficient matrix different when using np.corrcoef(x) and df.corr()?

x = np.array([[0, 2, 7], [1, 1, 9], [2, 0, 13]]).T
x_df = pd.DataFrame(x)
print("matrix:")
print(x)
print()
print("df:")
print(x_df)
print()

print("np correlation matrix: ")
print(np.corrcoef(x))
print()
print("pd correlation matrix: ")

print(x_df.corr())
print()

Gives me the output

matrix:
[[ 0  1  2]
 [ 2  1  0]
 [ 7  9 13]]

df:
   0  1   2
0  0  1   2
1  2  1   0
2  7  9  13

np correlation matrix: 
[[ 1.         -1.          0.98198051]
 [-1.          1.         -0.98198051]
 [ 0.98198051 -0.98198051  1.        ]]

pd correlation matrix: 
          0         1         2
0  1.000000  0.960769  0.911293
1  0.960769  1.000000  0.989743
2  0.911293  0.989743  1.000000

I'm guessing they are different types of correlation coefficients?

1
np.corrcoef(x.T)==x_df.corr() or print(np.corrcoef(x, rowvar=False)) - Alex Alex

1 Answers

2
votes

@AlexAlex is right, you are taking a different set of numbers in the correlation coefficients.

Think about it in a 2x3 matrix

x = np.array([[0, 2, 7], [1, 1, 9]])
np.corrcoef(yx)

gives

array([[1.        , 0.96076892],
       [0.96076892, 1.        ]])

and

x_df = pd.DataFrame(yx.T)
print(x_df)
x_df[0].corr(x_df[1])

gives

   0  1
0  0  1
1  2  1
2  7  9

0.9607689228305227

where the 0.9607... etc numbers match the output of the NumPy calculation.

If you do it the way in your calculation it is equivalent to comparing the correlation of the rows rather than the columns. You can fix the NumPy version using .T or the argument rowvar=False