Understanding of ROC Curve applied to Naive Bayesian Classifier python

Question

I am new to Machine Learning, and I am currently trying to implement the ROC Curve in Python 3.4 , which is applied to Naive Bayesian Classifier. Actual code of the classifier is given there:

from __future__ import division
from collections import defaultdict
from math import log

def train(samples):
    classes, freq = defaultdict(lambda:0), defaultdict(lambda:0)
    for feats, label in samples:
        classes[label] += 1                 # count classes frequencies
        for feat in feats:
            freq[label, feat] += 1          # count features frequencies

    for label, feat in freq:                # normalize features frequencies
        freq[label, feat] /= classes[label]
    for c in classes:                       # normalize classes frequencies
        classes[c] /= len(samples)

    return classes, freq                    # return P(C) and P(O|C)

def classify(classifier, feats):
    classes, prob = classifier
    return min(classes.keys(),              # calculate argmin(-log(C|O))
        key = lambda cl: -log(classes[cl]) + \
            sum(-log(prob.get((cl,feat), 10**(-7))) for feat in feats))

Example that I have some data containing names associated with gender, and I want to apply my classifier to this kind of data to predict a gender for a given name. Here's a few more code:

def get_features(sample): return (sample[-1],) # get last letter

samples = (line.split() for line in open('names.txt'))
features = [(get_features(feat), label) for feat, label in samples]
classifier = train(features)

print 'gender: ', classify(classifier, get_features('Mary'))

OK, so I have stucked with building ROC Curve there. Maybe it happened because of my misunderstanding of some basic concepts of Classifiers, actually I am totally disappointed. Using my classifier I can predict 'class' for the given name, as an argmin of value (-log((C|O)) as it's written in the code above, so function classify, when called, searches the class for which the value of logarithm will be minimum for all features relating to given name - that's exactly specified in the definition of Naive Bayes Classificator.

Next, I want to build ROC Curve for this classifier, but the problem is that my classify function returns a binary value which actually shows the predicted gender of a given person by calculating the argmin as I said before. I need a kind of threshold value which must be compared to classify function result to plot ROC Curve, something that can be changed +/- in order to get several (TPR, FPR) points.

Please help me a bit to eliminate this kind of unfortunate misunderstanding, so I can build my ROC Curve.

Andre Holzner Andre Holzner · Accepted Answer · 2015-11-02T17:40:42

Receiver operating characteristic (ROC) is used for illustrating the performance of a binary (two class) classifier. So you would have to restrict yourself to two classes (X vs. Y or X vs. not X) for a single curve (but you can repeat generating a curve for other pairs of classes).

Instead of finding the class C for which -log prob(C|O) is minimal, you would use the values prob(C1|O) (assuming that they are normalized the same way between rows of your data sample).

You can then scan over a threshold value t and decide to classify a row as belonging to class C1 if prob(C1|O) >= t.

For each t you can calculate

the true positive rate: fraction of rows which are actually in class C1 and would be classified as belonging to C1 because prob(C1|O) >= t
the false positive rate: fraction of rows which are NOT in class C1 but would be classified as belonging to C1 because prob(C1|O) >= t

In practice, you need only test the values prob(C1|O) you get on the rows of your data sample for t (in your example, I would expect to get something like 2**(number of features) different values).

Understanding of ROC Curve applied to Naive Bayesian Classifier python

1 Answers