7
votes

I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.

I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.

However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.

Rows - Actual labels

Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)

Is there a way to do this?

Edit: Here are more details.

In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.

That's why it gives a matrix which has the same labels for both rows and columns like this.

enter image description here

But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)

Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.

ValueError: Mix of label input types (string and number)

This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.

With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.

Hope the question is now clearer. Please let me know if it isn't.

2
Please clarify this with an example sampleVivek Kumar
Added more details. Thanks.Bee
So unless you know how to map a cluster number to your real results, how will you proceed?Vivek Kumar
That mapping part is exactly what I'm trying to learn. I just want to know if the real labels and natural cluster numbers can be mapped or not. I can do it myself if I can get real labels in columns and cluster names in rows (or the vice-versa). If I get the Iris dataset for an example, basically what I want to know is, how many setosas, how many virginica etc in each of my new clusters. Do you understand what I'm looking for?Bee
Check the chapter on clustering performance evaluation in scikit-learn documentation (e.g., Adjusted Rand index, Normalized/Adjusted Mutual Information, V-measure).σηγ

2 Answers

2
votes

I wrote a code myself.

# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
    uniqueLabels = list(set(act_labels))
    clusters = list(set(pred_labels))
    cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
    for i, act_label in enumerate(uniqueLabels):
        for j, pred_label in enumerate(pred_labels):
            if act_labels[j] == act_label:
                cm[i][pred_label] = cm[i][pred_label] + 1
    return cm

# Example
labels=['a','b','c',
        'a','b','c',
        'a','b','c',
        'a','b','c']
pred=[  1,1,2,
        0,1,2,
        1,1,1,
        0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
      for row in cnf_matrix]))

Edit: (Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.

labels=['a','b','c',
        'a','b','c',
        'a','b','c',
        'a','b','c']
pred=[  1,1,2,
        0,1,2,
        1,1,1,
        0,1,2]   

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})

# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])

# Display ct
print(ct)
1
votes

You can easily compute a pairwise intersection matrix.

But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.