2
votes

There is a multi-classification problem with 27 classes.

y_predict=[0 0 0 20 26 21 21 26 ....]

y_true=[1 10 10 20 26 21 18 26 ...]  

A list named "answer_vocabulary" stored the corresponding 27 words to each index. answer_vocabulary=[0 1 10 11 2 3 agriculture commercial east living north .....]

cm = confusion_matrix(y_true=y_true, y_pred=y_predict)

I'm confused about the order of the confusion matrix. It is in an ascending index order? And if I want to reorder the confusion matrix with a label sequence=[0 1 2 3 10 11 agriculture commercial living east north ...], how can I implement it?

Here is a function I have tried to plot confusion matrix.

def plot_confusion_matrix(cm, classes,
                        normalize=False,
                        title='Confusion matrix',
                        cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
2

2 Answers

1
votes

The confusion matrices from sklearn don't store information about how the matrix was created (class ordering, and normalization): this means you must use the confusion matrix as soon as you create it or the information will be lost.

By default, sklearn.metrics.confusion_matrix(y_true,y_pred) create the matrix in the order the classes appear in y_true.

If you pass this data to sklearn.metrix.confusion_matrix:

+--------+--------+
| y_true | y_pred |
+--------+--------+
| A      | B      |
| C      | C      |
| D      | B      |
| B      | A      |
+--------+--------+

Scikit-leart will create this confusion matrix (zeros omited):

+-----------+---+---+---+---+
| true\pred | A | C | D | B | 
+-----------+---+---+---+---+
| A         |   |   |   | 1 |
| C         |   | 1 |   |   |
| D         |   |   |   | 1 |
| B         | 1 |   |   |   |
+-----------+---+---+---+---+

And it will return this numpy matrix to you:

+---+---+---+---+
| 0 | 0 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 |
+---+---+---+---+

If you want to select classes, or reorder them you can pass the 'labels' argument to confusion_matrix().

For reordering:

labels = ['D','C','B','A']
mat = confusion_matrix(true_y,pred_y, labels=labels)

Or, if you just want to focus on some labels (useful if you have a lot of labels):

labels = ['A','D']
mat = confusion_matrix(true_y,pred_y, labels=labels)

Also,take a look at sklearn.metrics.plot_confusion_matrix. It works very well for small (<100) classes.

If you have >100 classes it will take a white to plot the matrix.

0
votes

The order of the columns/rows in the resulting confusion matrix is the same as returned by sklearn.utils.unique_labels(), which extracts "an ordered array of unique labels". In the source code of confusion_matrix() (main, git-hash 7e197fd), the lines of interest read as follows

if labels is None:
    labels = unique_labels(y_true, y_pred)
else:
    labels = np.asarray(labels)

Here, labels is the optional argument of confusion_matrix() to prescribe an ordering/subset of labels yourself:

cm = confusion_matrix(true_y, pred_y, labels=labels)

Therefore, if labels = [0, 10, 3], cm will have shape (3,3), and the rows/columns can be indexed directly with labels. If you know pandas:

import pandas as pd
cm = pd.DataFrame(cm, index=labels, columns=labels)

Note that the docs of unique_labels() state that mixed types of labels (numeric and string) are not supported. In this case, I'd recommend to use a LabelEncoder. This will save you from maintaining your own lookup-table.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# y have now values between 0 and n_labels-1.
# Do some ops here...
...

# To convert back:
y_pred = encoder.inverse_transform(y_pred)
y = encoder.inverse_transform(y)

As the previous answer already mentioned, plot_confusion_matrix() comes in handy to visualize the confusion matrix.