I'm hoping to generate new "fake" data from the data I already have with numpy.random.multivariate_normal
.
With n samples and d features in an n x d pandas DataFrame:
means = data.mean(axis=0)
covariances = data.cov()
variances = data.var()
means.shape, covariances.shape, variances.shape
>>> ((16349,), (16349, 16349), (16349,))
This looks fine, but the covariance matrix covariances
isn't positive semidefinite, which is a requirement of numpy.random.multivariate_normal
.
x = np.linalg.eigvals(covariances)
np.all(x >= 0)
>>> False
len([y for y in x if y < 0]) # negative eigenvalues
>>> 4396
len([y for y in x if y > 0]) # positive eigenvalues
>>> 4585
len([y for y in x if y == 0]) # zero eigenvalues.
>>> 7368
However, Wikipedia says
In addition, every covariance matrix is positive semi-definite.
Which leads me to wonder whether pandas.DataFrame.cov gets you a real covariance matrix. Here's the function's implementation. It seems to mostly defer to numpy.cov which also seems to promise a covariance matrix.
Can someone clear this up for me? Why is pandas.DataFrame.covs()
not positive semidefinite?
Updated question:
From the first answer, it seems like all the negative eigenvalues are tiny. The author of that answer suggests clipping these eigenvalues, but it's still unclear to me how to sensibly generate a proper covariance matrix with this information.
I can imagine using pd.DataFrame.cov()
, doing eigendecomposition to get eigenvectors and values, clipping the values, and then multiplying those matrices to get a new covariance matrix, but that seems quite precarious. Is that done in practice, or is there a better way?