Python - Clustering with K-means. Some columns with zero variance

Question

I have a data set consisting of ~200 99x20 arrays of frequencies, with each column summing to unity. I have plotted these using heatmaps like this . Each array is pretty sparse, with only about 1-7/20 values per 99 positions being nonzero.

However, I would like to cluster these samples in terms of how similar their frequency profiles are (minimum euclidean distance or something like that). I have arranged each 99x20 array into a 1980x1 array and aggregated them into a 200x1980 observation array.

Before finding the clusters, I have tried whitening the data using scipy.cluster.vq.whiten. whiten normalizes each column by its variance, but due to the way I've flattened my data arrays, I have some (8) columns with all zero frequencies, so the variance is zero. Therefore the whitened array has infinite values and the centroid finding fails (or gives ~200 centroids).

My question is, how should I go about resolving this? So far, I've tried

Don't whiten the data. This causes k-means to give different centroids every time it's run (somewhat expected), despite increasing the iter keyword considerably.
Transposing the arrays before I flatten them. The zero variance columns just shift.

Is it ok to just delete some of these zero variance columns? Would this bias the clustering in any way?

EDIT: I have also tried using my own whiten function which just does

for i in range(arr.shape[1]):
    if np.abs(arr[:,i].std()) < 1e-8: continue
    arr[:,i] /= arr[:,i].std()

This seems to work, but I'm not sure if this is biasing the clustering in any way.

Thanks

As a minor coding point, when checking for 0.0 floating point values don't use equality checks. if arr[:,i].std() == 0 should be if abs(arr[:,i].std()) < epsilon where epsilon is a very small value like 0.0000001. Otherwise you can get rounding errors causing 0 value float to appear as non-zero. For the given problem it might always work fine, but in general the above is a better way of doing floating 'equality' checks. — Pyrce

Erotemic Erotemic · Accepted Answer · 2013-03-21T16:47:46

Removing the column of all 0's should not bias the data. If you have N dimensional data, but one dimension is all the same number, it is exactly the same as having N-1 dimensional data. This property of effective-dimensionality is called rank.

Consider 3-D data, but all of your data points are on the x=0 plane. Can you see how this is exactly the same as 2D data?

Python - Clustering with K-means. Some columns with zero variance

2 Answers