I have a 4-by-3 matrix, X
, and wish to form the 3-by-3 Pearson correlation matrix, C
, obtained by computing correlations between all 3 possible column combinations of X
. However, entries of C
that correspond to correlations that aren't statistically significant should be set to zero.
I know how to get pair-wise correlations and significance values using pearsonr
in scipy.stats
. For example,
import numpy as np
from scipy.stats.stats import pearsonr
X = np.array([[1, 1, -2], [0, 0, 0], [0, .2, 1], [5, 3, 4]])
pearsonr(X[:, 0], X[:, 1])
returns (0.9915008164289165, 0.00849918357108348)
, a correlation of about .9915 between columns one and two of X
, with p-value .0085.
I could easily get my desired matrix using nested loops:
- Pre-populate
C
as a 3-by-3 matrix of zeros. - Each pass of the nested loop will correspond to two columns of
X
. The entry ofC
corresponding to this pair of columns will be set to the pairwise correlation provided the p-value is less than or equal to my threshold, say .01.
I'm wondering if there's a simpler way. I know in Pandas, I can create the correlation matrix, C
, in basically one line:
import pandas as pd
df = pd.DataFrame(data=X)
C_frame = df.corr(method='pearson')
C = C_frame.to_numpy()
Is there a way to get the matrix or data frame of p-values, P
, without a loop? If so, how could I set each entry of C
to zero should the corresponding p-value in P
exceed my threshold?
C_frame.where(C_frame>0.99)
? – Quang Hoangmethod
argument to return thep-values
instead of the correlation coefficients. You could use that tomask
yourdf.corr()
result. Though it's still a loop... – ALollz