Hierarchical Clustering Dendrogram using python

Question

Graph theory and Data mining are two fields of computer science I'm still new at, so excuse my basic understanding.

I have been asked to plot a Dendrogram of a hierarchically clustered graph. The input I have been given is the following : a list of all the edges of this graph.

So far I have been able to draw the graph from the input.

The next step would be clustering the graph, then plotting the Dendrogram from that Clustered graph.

My question is : Can someone give me a step by step guide to follow ? what input/output is required/returned during both steps of the process. (Clustering, getting the Dendrogram)

Note :

So far I have been using graph-tool to draw the graphs, I also ran a test code I found on the internet from the Scipy.clustering.hierarchy package, and it seems to have all the needed functions.

Nigel Savage Nigel Savage · Accepted Answer · 2014-12-09T17:34:14

You are correct the 'Scipy.clustering.hierarchy package' is correct here is some python pseudo code to show you the general idea.

from your statement 'So far I have been able to draw the graph from the input.' I am assuming that you have a good start with getting the input data into python ect ..

start python Clustering pseudo code

I needed at least these python packages

import scipy.cluster.hierarchy  as sch
import numpy as np
import sys

you need a 'distance metric', if your input data was 'strings' then you would use something like this

from Levenshtein import jaro

get the Matrix Dimension from the labels for the distance matrix

distanceMatrixDimension= len(p_names)

get coordinates of the upper triangle
upper_triangle=np.triu_indices(distanceMatrixDimension,1)

get the distances distances=np.apply_along_axis(dis,0,upper_triangle)

start the clustering algorithm from the 'Scipy.clustering.hierarchy' package get the linkage matrix Z here 'average' is the method of the linkage Z=sch.linkage(distances,'average')

get a bound for the metrics dimension values generated from the data max_dist=distances.max()

0.2.max_dist acts like a threshold value, experiment with different values fclster=sch.fcluster(Z,0.2*max_dist,'distance')

end python Clustering pseudo code

Z is a linked hierarchical agglomeration clustering of your data another way to say this is it is a (hierarchical)'tree' with a root node that branches down to your leaf nodes that are generally a record or row in the input data you want to cluster

the Deprogram is just a visualization of this tree. There are a number of ways to do this, you can get plot variables from the dimensions of Z. The best way to do this is with matlab or octave. generally you use dendrogram from scipy to plot the 'dendrogram'

import matplotlib.pyplot as plt

then

dendrogram(Z, color_threshold=1, labels=[....],show_leaf_counts=True)

Hierarchical Clustering Dendrogram using python

1 Answers