I am new to Machine Learning, and I am currently trying to implement the ROC Curve in Python 3.4 , which is applied to Naive Bayesian Classifier. Actual code of the classifier is given there:
from __future__ import division
from collections import defaultdict
from math import log
def train(samples):
classes, freq = defaultdict(lambda:0), defaultdict(lambda:0)
for feats, label in samples:
classes[label] += 1 # count classes frequencies
for feat in feats:
freq[label, feat] += 1 # count features frequencies
for label, feat in freq: # normalize features frequencies
freq[label, feat] /= classes[label]
for c in classes: # normalize classes frequencies
classes[c] /= len(samples)
return classes, freq # return P(C) and P(O|C)
def classify(classifier, feats):
classes, prob = classifier
return min(classes.keys(), # calculate argmin(-log(C|O))
key = lambda cl: -log(classes[cl]) + \
sum(-log(prob.get((cl,feat), 10**(-7))) for feat in feats))
Example that I have some data containing names associated with gender, and I want to apply my classifier to this kind of data to predict a gender for a given name. Here's a few more code:
def get_features(sample): return (sample[-1],) # get last letter
samples = (line.split() for line in open('names.txt'))
features = [(get_features(feat), label) for feat, label in samples]
classifier = train(features)
print 'gender: ', classify(classifier, get_features('Mary'))
OK, so I have stucked with building ROC Curve there. Maybe it happened because of my misunderstanding of some basic concepts of Classifiers, actually I am totally disappointed.
Using my classifier I can predict 'class' for the given name, as an argmin of value (-log((C|O)) as it's written in the code above, so function classify, when called, searches the class for which the value of logarithm will be minimum for all features relating to given name - that's exactly specified in the definition of Naive Bayes Classificator.
Next, I want to build ROC Curve for this classifier, but the problem is that my classify function returns a binary value which actually shows the predicted gender of a given person by calculating the argmin as I said before.
I need a kind of threshold value which must be compared to classify function result to plot ROC Curve, something that can be changed +/- in order to get several (TPR, FPR) points.
Please help me a bit to eliminate this kind of unfortunate misunderstanding, so I can build my ROC Curve.