15
votes

By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.

This classifier is useful as a simple baseline to compare with other (real) classifiers. Do not use it for real problems.

What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:

generates predictions by respecting the training set’s class distribution.

Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.

3

3 Answers

25
votes

The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.

Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.

Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.

If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:

from sklearn.dummy import DummyClassifier
import numpy as np

two_dimensional_values = []
class_labels           = []

for i in xrange(90):
    two_dimensional_values.append( [1,1] )
    class_labels.append(1)

for i in xrange(10):
    two_dimensional_values.append( [0,0] )
    class_labels.append(0)

#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )

#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )

#this produces 100 predictions that say "1"
for i in two_dimensional_values:
    print( dummy_classifier.predict( [i]) )

#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )

#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
    print( new_dummy_classifier.predict( [i]) )
2
votes

A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.

-1
votes

Using the Doc To illustrate DummyClassifier, first let’s create an imbalanced dataset:

>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Next, let’s compare the accuracy of SVC and most_frequent:

>>>
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test) 
0.63...

>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)  
0.57...

We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:

>>>
>>> clf = SVC(gamma='scale', kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)  
0.97...

We see that the accuracy was boosted to almost 100%. So this is better.