4
votes

I used Spectral Clustering to cluster some word feature vectors. I used cosine similarity matrix of the word vectors as the precomputed affinity matrix in Spectral Clustering algo. This is what I got as three cluster. I used the cluster assignment as the labels to color the points.

enter image description here However I used the same cosine similarity matrix and calculated the distance matrix from it.

dist=1-cosimilarity. 

And used Multi-Dimensional Scaling algo to visualize the points in a 3 Dimensional Space. This is what I got from that. I used that actual points label to color the points.

enter image description here

If you see MDS groupings those don't show any distinct cluster grouping while Spectral one does seem to show three pretty decent clusters.

1). What do you think the reason might be?. Because MDS is simply plotting the points on a lower dimensional scale by the virtue of the distance between them (MDS reduces the distance between the points in the same proportion as the distance between them in higher dimensions, thereby preserving the proportion of between points distance same) and not clustering them using some distance criteria?. I was hoping to see the spectral result a bit similar to the MDS one since both are using the distance measure. Just that clustering is fusing the nearby points into one. While MDS is plotting as it is. But in MDS I have used the actual labels of the points. So that indicates the ground truth?.

2). Any suggestions to improve on the clustering using some other algo or this looks decent?.

EDIT:

It appears there are three groups , but red is overlapping with blue and green groups. But what if I say it appears to us it is overlapping because of our inability to see things beyond 3D (which anyway is 2D on a computer screen). What if the red points are way above blue and green if we happen to see them in 3D or even more dimensions.

Take an example below:

Lets say you are looking at three points.

enter image description here

Imagine the points you see above appear to you if you look for 1D. Maybe right from the top.

Now what if you get to see these points in 2D. It appears as below.

enter image description here

If you see now, all the three points appear separated. Just that we were able to see the points in 2D we were actually able to see the relationship between the points which appeared to us as in straight line when visualized from top and in 1D.

So do you think the clustering result like above might actually be a clearly separated set of points and at a pretty far distance from each group in reality if we are able to visualize them in 3D which to us currently appears not well separated?. Would that mean there might be cases like above where visualization of clustering result might not be the right way as it might lead to conclusions as above?.

EDIT II:

Ran A DBSCAN algo using the following code: Used the earlier calculated distance matrix as pre-computed matrix for DBSCAN

from sklearn.cluster import DBSCAN
    dbscan=DBSCAN(metric='precomputed')
    cluster_labels=dbscan.fit_predict(dist)

Please remember in all spectral and DBSCAN the points which I am plotting is the result out of Multi-Dimensional Scaling algo to three dimensions.

DBSCAN graph: enter image description here

EDIT III:

Calculation of Similarity and Distance Matrix:

 from sklearn.metrics.pairwise import cosine_similarity

    cossim= cosine_similarity(X)

    dist=1-cossim

    dist=np.round(dist, 2)

EDIT IV:

Since Anony-Mousse suggested maybe Cosine Similarity might not be the right metric , so now what I did I just used the original bag of words design matrix in DBSCAN algo and did not provide any distance matrix as pre-computed option to it. And left it to sklearn to calculate the right affinity matrix. Now this is what I get. Just one cluster. That's it. It doesn't separate the points at all. Please remember this is the same bag of words matrix of data points as the distance matrix used above. The underlying data points are same.

enter image description here

EDIT V:

Converting count data in bag of words to tf-idf data.

1). Ran DBSCAN on pre-computed Distance matrix of tf-idf bag of words data. In all the earlier approaches I have simply used count bag of words matrix as the base. Now I am using tf-idf matrix as the base for distance of cosine similarity.

DBSCAN on Distance matrix: enter image description here

It kind of gives two types of points but again no separation.

DBSCAN on original Tf-idf bag of words matrix:

enter image description here

Just one single red blob.

Spectral on cosine similarity matrix of tf-idf data:

enter image description here

Just look at it. It is even worse than the result I get when I used spectral on similarity matrix of count bag of words data. Using tf-idf is simply screwing things up for spectral.

Spectral on Similarity matrix of count data:

enter image description here

If you see Spectral when done on just count data is giving still some local groupings while done on tf-idf data it is all mixed up.

Secondly I have 4000 features. To visualize I need to do max in 3 dimensions. So for that I am using results out of MDS. MDS is like PCA so if someone need to visualize the points either MDS or PCA needs to be done.

EDIT VI:

So as per Anony-Mousse comments that MDS is screwing things, I thought why not try with PCA. So I get my clusters from DBSCAN or Spectral and then I plot the first three PC's from a PCA . Here is the code I wrote for PCA. The docs_bow here can be tf-idf or even normal count docs_bow.

def PCA(docs_bow): 
    """ This function carries out PCA of the dataset
    """

    #Doing PCA (SVD)
    U,S,Vt=np.linalg.svd(docs_bow,full_matrices=False)
    # Projection matrix with Principal Component Loading vectors. Transpose of Vt.(Right singular vectors)
    W=Vt.T
    # Keep only the top 3 Dimensions
    W=W[:,0:3]

    # Finding our Projected datapoints on those two PC's
    PC=np.dot(docs_bow,W)

    return PC

Spectral on cosine matrix and plotting result of MDS: enter image description here

Spectral on cosine matrix and first three PC's of PCA: enter image description here

DBSCAN of Distance Matrix and MDS first three dimensions: enter image description here

DBSCAN of distance matrix and first three PC's of PCA: enter image description here

MDS seem to give better visualization than PCA. I am not what else one can use to reduce dimensions for visualizing a high dimension data.

1

1 Answers

6
votes

If you disable the colors, how many clusters do you see?

I'd say there is only one big cluster in this data set, at least with this preprocessing/visualization.

The three clusters you get from spectral clustering are meaningless. It's essentially doing quantization, but it did not discover structure. It minimized squared deviations by chunking the data into three similarly sized chunks. But if you would run it again, it would probably produce similar looking, but different chunks: the result is largely random, sorry.

Do not expect classes to agree with clusters. As you can see in this data set, it may have three labels, but only one big messy "cluster".

Its easy to produce this effect:

from sklearn import datasets, cluster
data=datasets.make_blobs(5000, 3, 1)[0]
c=cluster.SpectralClustering(n_clusters=3).fit_predict(data)
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
scatter(data[:,0],data[:,1],color=colors[c].tolist())

produces this:

Meaningless Spectral Clustering

Notice how similar this is to your result? Yet, this is a single gaussian, there are no clusters here. Spectral clustering produced 3 clusters here as requested, but they are totally meaningless.

Always check your results if they are meaningful. It's really easy to get something that looks like a nice partitioning, but that is a random convex partitioning. Just because it produced clusters doesn't mean the clusters are there! - Algorithms such as k-means and spectral clustering are prone to https://en.wikipedia.org/wiki/Pareidolia

Can you see the cluster in this face? ;-) Is it real?

enter image description here

Here is a rather successful projection of a text data set using MDS. It shows a number of clusters stretching out into different directions of the data space. K-means and variants will not work well on this data set. Gaussian Mixture Modeling would work, but only on the projected data.

enter image description here