comparison of clustering algorithms performance in rapidminer

votes

I have applied different clustering algos like kmean, kmediod kmean-fast and expectation max clustering on my biomedical dataset using Rapidminer. now i want to check the performance of these algos that which algo gives better clustering results. For that i have applied some operators like 'cluster density performance' and 'cluster distance performance' which gives me avg within cluster distance for each cluster and davis bouldin. but I am confused is it the right way to check clustering performance of each algo with these operators? I am also interested in Silhouette method which i can apply for each algo to check performance but can't understand from where i can get b(i) and a(i) values from clustering algo output.

validationdata-miningcluster-analysis

3 Answers

votes

The most reliable way of evaluating clusterings is by looking at your data. If the clusters are of any use to you and make sense to a domain expert!

Never just rely on numbers.

For example, you can evaluate clusterings numerically by taking the within-cluster variance.

However, k-means optimizes exactly this value, so k-means will always come out best, and in fact this measure decreases with the number of k - but the results do not at all become more meaningful!

It is somewhat okay to use one coefficient such as Silhouette coefficient to compare results of the same algorithm this way. Since Silhouette coefficient is somewhat orthogonal to the variance minimization, it will make k-means stop at some point, when the result is a reasonable balance of the two objectives.

However, applying such a measure to different algorithms - which may have a different amount of correlation to the measure - is inherently unfair. Most likely, you will be overestimating one algorithm and underestimating the performance of another.

Another popular way is external evaluation, with labeled data. While this should be unbiased against the method -- unless the labels were generated by a similar method -- it has different issues: it will punish a solution that actually discovers new clusters!

All in all, evaluation of unsupervised methods is hard. Really hard. Best you can do, is see if the results prove useful in practise!

votes

It's very good advice never to rely on the numbers.

All the numbers can do is help you focus on particular clusterings that are mathematically interesting. The Davies-Bouldin validity measure is nice because it will show a minimum when it thinks the clusters are most compact with respect to themselves and most separated with respect to others. If you plot a graph of the Davies-Bouldin measure as a function of k, the "best" clustering will show up as a minimum. Of course the data may not form into spherical clusters so this measure may not be appropriate but that's another story.

The Silhouette measure tends to a maximum when it identifies a clustering that is relatively better than another.

The cluster density and cluster distance measures often exhibit an "elbow" as they tend to zero. This elbow often coincides with an interesting clustering (I have to be honest and say I'm never really convinced by this elbow criterion approach).

If you were to plot different validity measures as a function of k and all of the measures gave an indication that a particular k was better then others that would be good reason to consider that value in more detail to see if you, as the domain expert for the data, agree.

If you're interested I have some examples here.

votes

There are many ways to evaluate the performance of clustering models in machine learning. They are broadly divided into 3 categories-

1. Supervised techniques

2. Unsupervised techniques

3. Hybrid techniques

Supervised techniques are evaluated by comparing the value of evaluation metrics with some pre-defined ground rules and values.

For example- Jaccard similarity index, Rand Index, Purity etc.

Unsupervised techniques comprised of some evaluation metrics which cannot be compared with pre-defined values but they can be compared among different clustering models and thereby we can choose the best model.

For example - Silhouette measure, SSE

Hybrid techniques are nothing but combination of supervised and unsupervised methods.

Now, let’s have a look at the intuition behind these methods-

Silhouette measure

Silhouette measure is derived from 2 primary measures- Cohesion and Separation.

Cohesion is nothing but the compactness or tightness of the data points within a cluster.

There are basically 2 ways to compute the cohesion-

· Graph based cohesion
· Prototype based cohesion

Let’s consider that A is a cluster with 4 data points as shown in the figure-

Graph based cohesion computes the cohesion value by adding the distances (Euclidean or Manhattan) from each point to every other points.

Here,

Graph Cohesion(A) = Constant * ( Dis(1,2) + Dis(1,3) + Dis(1,4) + Dis(2,3) + Dis(2,4) + Dis(3,4) )

Where,

Constant = 1/ (2 * Average of all distances)

Prototype based cohesion is calculated by adding the distance of all data points from a commonly accepted point like centroid.

Here, let’s consider C as centroid in cluster A

Then,

Prototype Cohesion(A) = Constant * (Dis(1,C) +Dis(2,C) + Dis(3,C) + Dis(4,C))

Where,

Constant = 1/ (2 * Average of all distances)

Separation is the distance or magnitude of difference between the data points of 2 different clusters.

Here also, we have primarily 2 kinds of methods of computation of separation value.

1. Graph based separation

2. Prototype based separation

Graph based separation calculate the value by adding the distance between all points from Cluster 1 to each and every point in Cluster 2.

For example, If A and B are 2 clusters with 4 data points each then,

Graph based separation = Constant * ( Dis(A1,B1) + Dis(A1,B1) + Dis(A1,B2) + Dis(A1,B3) + Dis(A1,B4) + Dis(A2,B1) + Dis(A2,B2) + Dis(A2,B3) + Dis(A2,B4) + Dis(A3,B1) + Dis(A3,B2) + Dis(A3,B3) + Dis(A3,B4) + Dis(A4,B1) + Dis(A4,B2) + Dis(A4,B3) + Dis(A4,B4) )

Where,

Constant = 1/ number of clusters

Prototype based separation is calculated by finding the distance between the commonly accepted points of a 2 clusters like centroid.

Here, we can simply calculate the distance between the centroid of 2 cluster A and B i.e. Dis(C(A),C(B)) multiplied by a constant where constant = 1/ number of clusters.

Silhouette measure = (b-a)/max(b,a)

Where,

a = cohesion value

b = separation value

If Silhouette measure = -1 then it means clustering is very poor.

If Silhouette measure = 0 then it means clustering is good but some improvements are still possible.

If Silhouette measure = 1 then it means clustering is very good.

When we have multiple clustering algorithms, it is always recommended to choose the one with high Silhouette measure.

SSE ( Sum of squared errors)

SSE is calculated by adding Cohesion and Separation values.

SSE = Value ( Cohesion) + Value ( Separation).

When we have multiple clustering algorithms, it is always recommended to choose the one with low SSE.

Jaccard similarity index Jaccard similarity index is measured using the labels in data points. If data points are not provided then we cannot measure this index.

The data points are divided into 4 categories-

True Negative (TN)= Data points with same class and different cluster

True Positive (TP) = Data points with same class and same cluster

False Negative (FN) = Data points with same class and different cluster

False Positive (FP) = Data points with different class and same cluster

Here,

Note - nC2 means number of combinations with 2 elements possible from a set containing n elements

nC2 = n*(n-1)/2

TP = 5C2 + 4C2 + 2C2 + 3C2 = 20

FN = 5C1 * 1C1 + 5C1 * 2C1 + 1C1 * 4C1 + 1C1 * 2C1 + 1C1 * 3C1 = 24

FP = 5C1 * 1C1 + 4C1 * 1C1 +4C1 * 1C1 + 1C1 * 1C1 +3C1 * 2C1 = 20

TN= 5C1 * 4C1 + 5C1 * 1C1 + 5C1 * 3C1 + 1C1 * 1C1 + 1C1 * 1C1 + 1C1 * 2C1 + 1C1 * 3C1 + 4C1 * 3C1 + 4C1 * 2C1 + 1C1 * 3C1 + 1C1 * 2C1 = 72

Jaccard similarity index = TP/ (TP + TN + FP +FN)

Here, Jaccard similarity index = 20 / (20+ 72 + 20 + 24) = 0.15

Rand Index

Rand Index is similar to Jaccard similarity index. Its formula is given by-

Rand Index = (TP + TN) / (TP + TN + FP +FN)

Here, Rand Index = (20 + 72) / (20+ 72 + 20 + 24) = 0.67

When Rand index is above 0.7, it can be considered as a good clustering.

Similarly, when Jaccard similarity index is above 0.5, it can be considered as good clustering.

Purity

This metrics also requires labels in the data. The formula is given by-

Purity = (Number of data points belonging to the label which is maximum in Cluster 1 + Number of data points belonging to the label which is maximum in Cluster 2 +.... + Number of data points belonging to the label which is maximum in Cluster n ) / Total number of data points.

For example, lets consider 3 clusters – A , B and C with labelled data points

Purity = (a + b + c) / n

Where,

a = Number of black circles in cluster A (Since black is the maximum in count)

b = Number of red circles in cluster B (Since red is the maximum in count)

c = Number of green circles in cluster C (Since green is the maximum in count)

n = Total number of data points

Here, Purity = (5 + 6 + 3) / (8 + 9 + 5) = 0.6

If purity is greater than 0.7 then it can be considered as a good clustering.

Original source - https://qr.ae/pNsxIX