I used Spectral Clustering to cluster some word feature vectors. I used cosine similarity matrix of the word vectors as the precomputed affinity matrix in Spectral Clustering algo. This is what I got as three cluster. I used the cluster assignment as the labels to color the points.
However I used the same cosine similarity matrix and calculated the distance matrix from it.
dist=1-cosimilarity.
And used Multi-Dimensional Scaling algo to visualize the points in a 3 Dimensional Space. This is what I got from that. I used that actual points label to color the points.
If you see MDS groupings those don't show any distinct cluster grouping while Spectral one does seem to show three pretty decent clusters.
1). What do you think the reason might be?. Because MDS is simply plotting the points on a lower dimensional scale by the virtue of the distance between them (MDS reduces the distance between the points in the same proportion as the distance between them in higher dimensions, thereby preserving the proportion of between points distance same) and not clustering them using some distance criteria?. I was hoping to see the spectral result a bit similar to the MDS one since both are using the distance measure. Just that clustering is fusing the nearby points into one. While MDS is plotting as it is. But in MDS I have used the actual labels of the points. So that indicates the ground truth?.
2). Any suggestions to improve on the clustering using some other algo or this looks decent?.
EDIT:
It appears there are three groups , but red is overlapping with blue and green groups. But what if I say it appears to us it is overlapping because of our inability to see things beyond 3D (which anyway is 2D on a computer screen). What if the red points are way above blue and green if we happen to see them in 3D or even more dimensions.
Take an example below:
Lets say you are looking at three points.
Imagine the points you see above appear to you if you look for 1D. Maybe right from the top.
Now what if you get to see these points in 2D. It appears as below.
If you see now, all the three points appear separated. Just that we were able to see the points in 2D we were actually able to see the relationship between the points which appeared to us as in straight line when visualized from top and in 1D.
So do you think the clustering result like above might actually be a clearly separated set of points and at a pretty far distance from each group in reality if we are able to visualize them in 3D which to us currently appears not well separated?. Would that mean there might be cases like above where visualization of clustering result might not be the right way as it might lead to conclusions as above?.
EDIT II:
Ran A DBSCAN algo using the following code: Used the earlier calculated distance matrix as pre-computed matrix for DBSCAN
from sklearn.cluster import DBSCAN
dbscan=DBSCAN(metric='precomputed')
cluster_labels=dbscan.fit_predict(dist)
Please remember in all spectral and DBSCAN the points which I am plotting is the result out of Multi-Dimensional Scaling algo to three dimensions.
EDIT III:
Calculation of Similarity and Distance Matrix:
from sklearn.metrics.pairwise import cosine_similarity
cossim= cosine_similarity(X)
dist=1-cossim
dist=np.round(dist, 2)
EDIT IV:
Since Anony-Mousse suggested maybe Cosine Similarity might not be the right metric , so now what I did I just used the original bag of words design matrix in DBSCAN algo and did not provide any distance matrix as pre-computed option to it. And left it to sklearn to calculate the right affinity matrix. Now this is what I get. Just one cluster. That's it. It doesn't separate the points at all. Please remember this is the same bag of words matrix of data points as the distance matrix used above. The underlying data points are same.
EDIT V:
Converting count data in bag of words to tf-idf data.
1). Ran DBSCAN on pre-computed Distance matrix of tf-idf bag of words data. In all the earlier approaches I have simply used count bag of words matrix as the base. Now I am using tf-idf matrix as the base for distance of cosine similarity.
It kind of gives two types of points but again no separation.
DBSCAN on original Tf-idf bag of words matrix:
Just one single red blob.
Spectral on cosine similarity matrix of tf-idf data:
Just look at it. It is even worse than the result I get when I used spectral on similarity matrix of count bag of words data. Using tf-idf is simply screwing things up for spectral.
Spectral on Similarity matrix of count data:
If you see Spectral when done on just count data is giving still some local groupings while done on tf-idf data it is all mixed up.
Secondly I have 4000 features. To visualize I need to do max in 3 dimensions. So for that I am using results out of MDS. MDS is like PCA so if someone need to visualize the points either MDS or PCA needs to be done.
EDIT VI:
So as per Anony-Mousse comments that MDS is screwing things, I thought why not try with PCA. So I get my clusters from DBSCAN or Spectral and then I plot the first three PC's from a PCA . Here is the code I wrote for PCA. The docs_bow here can be tf-idf or even normal count docs_bow.
def PCA(docs_bow):
""" This function carries out PCA of the dataset
"""
#Doing PCA (SVD)
U,S,Vt=np.linalg.svd(docs_bow,full_matrices=False)
# Projection matrix with Principal Component Loading vectors. Transpose of Vt.(Right singular vectors)
W=Vt.T
# Keep only the top 3 Dimensions
W=W[:,0:3]
# Finding our Projected datapoints on those two PC's
PC=np.dot(docs_bow,W)
return PC
Spectral on cosine matrix and plotting result of MDS:
Spectral on cosine matrix and first three PC's of PCA:
DBSCAN of Distance Matrix and MDS first three dimensions:
DBSCAN of distance matrix and first three PC's of PCA:
MDS seem to give better visualization than PCA. I am not what else one can use to reduce dimensions for visualizing a high dimension data.