4
votes

I have a data set consisting of ~200 99x20 arrays of frequencies, with each column summing to unity. I have plotted these using heatmaps like this. Each array is pretty sparse, with only about 1-7/20 values per 99 positions being nonzero.

However, I would like to cluster these samples in terms of how similar their frequency profiles are (minimum euclidean distance or something like that). I have arranged each 99x20 array into a 1980x1 array and aggregated them into a 200x1980 observation array.

Before finding the clusters, I have tried whitening the data using scipy.cluster.vq.whiten. whiten normalizes each column by its variance, but due to the way I've flattened my data arrays, I have some (8) columns with all zero frequencies, so the variance is zero. Therefore the whitened array has infinite values and the centroid finding fails (or gives ~200 centroids).

My question is, how should I go about resolving this? So far, I've tried

  • Don't whiten the data. This causes k-means to give different centroids every time it's run (somewhat expected), despite increasing the iter keyword considerably.
  • Transposing the arrays before I flatten them. The zero variance columns just shift.

Is it ok to just delete some of these zero variance columns? Would this bias the clustering in any way?

EDIT: I have also tried using my own whiten function which just does

for i in range(arr.shape[1]):
    if np.abs(arr[:,i].std()) < 1e-8: continue
    arr[:,i] /= arr[:,i].std()

This seems to work, but I'm not sure if this is biasing the clustering in any way.

Thanks

2
As a minor coding point, when checking for 0.0 floating point values don't use equality checks. if arr[:,i].std() == 0 should be if abs(arr[:,i].std()) < epsilon where epsilon is a very small value like 0.0000001. Otherwise you can get rounding errors causing 0 value float to appear as non-zero. For the given problem it might always work fine, but in general the above is a better way of doing floating 'equality' checks.Pyrce
@Pyrce Thanks, editted.wflynny

2 Answers

3
votes

Removing the column of all 0's should not bias the data. If you have N dimensional data, but one dimension is all the same number, it is exactly the same as having N-1 dimensional data. This property of effective-dimensionality is called rank.

Consider 3-D data, but all of your data points are on the x=0 plane. Can you see how this is exactly the same as 2D data?

3
votes

First of all, dropping constant columns is perfectly fine. Obviously they do not contribute information, so no reason to keep them.

However, K-means is not particularly good for sparse vectors. The problem is that most likely the resulting "centroids" will be more similar to each other than to the cluster members. See, in sparse data, every object is to some extend an outlier. And K-means is quite sensitive to outliers because it tries to minimize the sum of squares.

I suggest that you do the following:

  1. Find a similarity measure that works for your domain. Spend quite a lot of time on this, how to capture similarity for your particular use case.

  2. Once you have that similarity, compute the 200x200 similarity matrix. As your data set is really tiny, you can actually run expensive clustering methods such as hierarchical clustering, that would not scale to thousands of objects. If you want, you could also try OPTICS clustering or DBSCAN. But in particular DBSCAN is actually more interesting if your data set is much larger. For tiny data sets, hierarchical clustering is fine.