How can I compute the Pearson correlation matrix and retain only significant values?

Question

I have a 4-by-3 matrix, X, and wish to form the 3-by-3 Pearson correlation matrix, C, obtained by computing correlations between all 3 possible column combinations of X. However, entries of C that correspond to correlations that aren't statistically significant should be set to zero.

I know how to get pair-wise correlations and significance values using pearsonr in scipy.stats. For example,

import numpy as np
from scipy.stats.stats import pearsonr

X = np.array([[1, 1, -2], [0, 0, 0], [0, .2, 1], [5, 3, 4]])
pearsonr(X[:, 0], X[:, 1])

returns (0.9915008164289165, 0.00849918357108348), a correlation of about .9915 between columns one and two of X, with p-value .0085.

I could easily get my desired matrix using nested loops:

Pre-populate C as a 3-by-3 matrix of zeros.
Each pass of the nested loop will correspond to two columns of X. The entry of C corresponding to this pair of columns will be set to the pairwise correlation provided the p-value is less than or equal to my threshold, say .01.

I'm wondering if there's a simpler way. I know in Pandas, I can create the correlation matrix, C, in basically one line:

import pandas as pd

df = pd.DataFrame(data=X)
C_frame = df.corr(method='pearson') 
C = C_frame.to_numpy()

Is there a way to get the matrix or data frame of p-values, P, without a loop? If so, how could I set each entry of C to zero should the corresponding p-value in P exceed my threshold?

stackoverflow.com/questions/52741236/… is relevant. The highly voted answer shows how to use the method argument to return the p-values instead of the correlation coefficients. You could use that to mask your df.corr() result. Though it's still a loop... — ALollz

Mad Physicist Mad Physicist · Accepted Answer · 2019-12-19T20:37:27

Looking through the docs for pearsonr reveals the fomulae used to compute the correlations. It should not be too difficult to get the correlations between each column of a matrix using vectorization.

While you could compute the value of C using pandas, I will show pure numpyan implementation for the entire process.

First, compute the r-values:

X = np.array([[1,  1, -2],
              [0,  0,  0],
              [0, .2,  1],
              [5,  3,  4]])
n = X.shape[0]

X -= X.mean(axis=0)
s = (X**2).sum(axis=0)
r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])

Computing the p values is made simple given the existence of the beta distribution in scipy. Taken directly from the docs:

dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))

You can trivially make a mask from p with your threshold, and apply it to r to make C:

mask = (p <= 0.01)
C = np.zeros_like(r)
C[mask] = r[mask]

A better option would probably be to modify your r in-place:

r[p > 0.1] = 0

In function form:

def non_trivial_correlation(X, threshold=0.1):
    n = X.shape[0]
    X = X - X.mean(axis=0) # Don't modify the original
    x = (X**2).sum(axis=0)
    r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
    p = 2 * scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2).cdf(-abs(r))
    r[p > threshold] = 0
    return r

How can I compute the Pearson correlation matrix and retain only significant values?

1 Answers