33
votes

Can someone please explain (with example maybe) what is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit-learn?

I've read documentation and I've understood that we use:

  • OneVsRestClassifier - when we want to do multiclass or multilabel classification and it's strategy consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes. (This is pretty clear and it means that problem of multiclass/multilabel classification is broken down to multiple binary classification problems).
  • MultiOutputClassifier - when we want to do multi target classification (what is this?) and it's strategy consists of fitting one classifier per target (what does target mean there?)

I've already used OneVsRestClassifier for multilabel classification and I can understand how does it work but then I found MultiOutputClassifier and can't understand how does it work differently from OneVsRestClassifier.

2

2 Answers

33
votes

Multiclass classification

To better illustrate the differences, let us assume that your goal is that of classifying SO questions into n_classes different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely 'Python', 'Java', 'C++' and 'Other language'. Let us assume that you have a dataset formed by just six SO questions, and the class labels of those questions are stored in an array y as follows:

import numpy as np
y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python'])

The situation described above is usually referred to as multiclass classification (also known as multinomial classification). In order to fit the classifier and validate the model through scikit-learn library you need to transform the text class labels into numerical labels. To accomplish that you could use LabelEncoder:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_numeric = le.fit_transform(y)

This is how the labels of your dataset are encoded:

In [220]: y_numeric
Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64)

where those numbers denote indices of the following array:

In [221]: le.classes_
Out[221]: 
array(['C++', 'Java', 'Other language', 'Python'], 
      dtype='|S14')

An important particular case is when there are just two classes, i.e. n_classes = 2. This is usually called binary classification.

Multilabel classification

Let us now suppose that you wish to perform such multiclass classification using a pool of n_classes binary classifiers, being n_classes the number of different classes. Each of these binary classifiers makes a decision on whether an item is of a specific class or not. In this case you cannot encode class labels as integer numbers from 0 to n_classes - 1, you need to create a 2-dimensional indicator matrix instead. Consider that sample n is of class k. Then, the [n, k] entry of the indicator matrix is 1 and the rest of the elements in row n are 0. It is important to note that if the classes are not mutually exclusive there can be multiple 1's in a row. This approach is named multilabel classification and can be easily implemented through MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_indicator = mlb.fit_transform(y[:, None])

The indicator looks like this:

In [225]: y_indicator
Out[225]: 
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

and the column numbers where 1's are actually indices of this array:

In [226]: mlb.classes_
Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object)

Multioutput classification

What if you want to classify a particular SO question according to two different criteria simultaneously, for instance language and application? In this case you intend to do multioutput classification. For the sake of simplicity I will consider only three application classes, namely 'Computer Vision', 'Speech Processing' and 'Other application'. The label array of your dataset should be 2-dimensional:

y2 = np.asarray([['Java', 'Computer Vision'],
                 ['C++', 'Speech Recognition'],
                 ['Other language', 'Computer Vision'],
                 ['Python', 'Other Application'],
                 ['C++', 'Speech Recognition'],
                 ['Python', 'Computer Vision']])

Again, we need to transform text class labels into numeric labels. As far as I know this functionality is not implemented in scikit-learn yet, so you will need to write your own code. This thread describes some clever ways to do that, but for the purposes of this post the following one-liner should suffice:

y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T

The encoded labels look like this:

In [229]: y_multi
Out[229]: 
array([[1, 0],
       [0, 2],
       [2, 0],
       [3, 1],
       [0, 2],
       [3, 0]], dtype=int64)

And the meaning of the values in each column can be inferred from the following arrays:

In [230]: le.fit(y2[:, 0]).classes_
Out[230]: 
array(['C++', 'Java', 'Other language', 'Python'], 
      dtype='|S18')

In [231]: le.fit(y2[:, 1]).classes_
Out[231]: 
array(['Computer Vision', 'Other Application', 'Speech Recognition'], 
      dtype='|S18')
2
votes

This is an extension to @tonechas answer. Read that answer before reading this. OVR supports Multilabel only when each label is a binary label/ class (also called binary multi-label) i.e., either the sample belongs to that label or doesn't. It will not work when the target is multioutput (also called multi-class multi-label) i.e. when each sample can belong to any one class within a label. For the later case, you need to use sklearn Multioutput classifier.

In otherwords, sklearn OVR does not work when your target variable looks like this,

y_true = np.arr([[2, 1, 0],
                 [0, 2, 1],
                 [1, 2, 4]])

where label1 has 4 classes [0, 1, 2, 3]; label2 has 3 classes [0, 1, 2]; label3 has 5 classes [0, 1, 2 , 3, 4]. Ex: The first sample belongs to class 2 in the label1, class 1 in label2, class 0 in label3. Think of it as the labels NOT being mutually exclusive while the classes within each label being mutually exclusive.

Sklearn OVR will work when,

y_true = np.arr([[0, 1, 1],
                 [0, 0, 1],
                 [1, 1, 0]])

where label1 labe2, label3 have only 2 classes each. So, a sample either belongs to that label or doesn't. Ex: The first sample belongs to label1 and label2.

I am sorry I couldn't find a real-world example for this kind of a usecase.