3
votes

The question is a matter of which should come first: a) the clustering or b) the dimensionality reduction algorithm? In other words, can I apply a pseudo (as it is not really) dimensionality reduction method like t-SNE and then use a clustering algorithm to extract clusters or should the clustering be performed on the original high-dimensional space and be used to just color the nodes? Is the following code a good way to start or am I completely mistaken?

adjMat = g.get_adjacency(attribute='weight') #get the adjacency matrix from a really large graph
adjMat = np.array(adjMat.data)
adjMat = adjMat.T #use the incoming interaction vectors 
#initiate the t-SNE algorithm
tsne = manifold.TSNE() #set dimensionality reduction algorithm
manifoldCoords = tsne.fit_transform(adjMat) 
#initiate clustering algorithm
clusteralgorithm = clusterAlgs.KMeans() #set clustering algorithm
linear_clusters = clusteralgorithm.fit_predict(manifoldCoords) #extract clusters
2

2 Answers

3
votes

It is always better to perform dimensionality reduction and then clustering.

The reason behind this is that distance in high dimensional space behave in a strange way. Another interesting phenomenon is that the ratio between the nearest and farthest points approaches 1.

I suggest you to read this question and although it asks about Euclidean distance, you can find many interesting information in general.

2
votes

It is very common to first reduce the dimensionality, then cluster. Simply because clustering high-dimensional data is hard, and dimensionality reduction makes it quite a bit more "tractable".

As long as you don't forget that clustering is inherently unreliable (so don't trust the results, but study them) you should be fine.