1
votes

I'm hoping to generate new "fake" data from the data I already have with numpy.random.multivariate_normal.

With n samples and d features in an n x d pandas DataFrame:

means = data.mean(axis=0)
covariances = data.cov()
variances = data.var()
means.shape, covariances.shape, variances.shape

>>> ((16349,), (16349, 16349), (16349,))

This looks fine, but the covariance matrix covariances isn't positive semidefinite, which is a requirement of numpy.random.multivariate_normal.

x = np.linalg.eigvals(covariances)

np.all(x >= 0)
>>> False

len([y for y in x if y < 0])  # negative eigenvalues
>>> 4396

len([y for y in x if y > 0])  # positive eigenvalues
>>> 4585

len([y for y in x if y == 0])  # zero eigenvalues.
>>> 7368

However, Wikipedia says

In addition, every covariance matrix is positive semi-definite.

Which leads me to wonder whether pandas.DataFrame.cov gets you a real covariance matrix. Here's the function's implementation. It seems to mostly defer to numpy.cov which also seems to promise a covariance matrix.

Can someone clear this up for me? Why is pandas.DataFrame.covs() not positive semidefinite?

Updated question:

From the first answer, it seems like all the negative eigenvalues are tiny. The author of that answer suggests clipping these eigenvalues, but it's still unclear to me how to sensibly generate a proper covariance matrix with this information.

I can imagine using pd.DataFrame.cov(), doing eigendecomposition to get eigenvectors and values, clipping the values, and then multiplying those matrices to get a new covariance matrix, but that seems quite precarious. Is that done in practice, or is there a better way?

1

1 Answers

1
votes

Probably what's happening is that the result is positive-semidefinite, to within the accuracy of the computation. For example:

In [71]: np.linalg.eigvals(np.cov(np.random.random((5,5))))
Out[71]: 
array([  1.87557170e-01,   9.98250875e-02,   6.85211153e-02,
         1.01062281e-02,  -5.99164839e-18])

has a negative eigenvalue, but the magnitude is small.

So in your shoes I'd verify that the magnitude of the violations was small, and then clip to zero.