Input to precision_recall_curve - predict or predict_proba output?

Question

I'm using Gaussian Naive Bayes to train a model from a Pandas data frame, but I'm getting an error when using precision_recall_curve. The documentation says precision_recall_curve takes the predicted probabilities as input (at least as I read it) so I would expect the below to work (xtrain and xtest are Pandas data frames with 736 and 184 rows respectively; ytrain/ytest are Series with 736 and 184 rows respectively):

nb = GaussianNB()
nb.fit(xtrain, ytrain)
predicted = nb.predict_proba(xtest)
precision, recall, threshold = precision_recall_curve(ytest, predicted)

I expect the above to work, however I receive an "IndexError: index 230 is out of bounds for size 184". If I instead do:

predicted = nb.predict(xtest)
precision, recall, threshold = precision_recall_curve(ytest, predicted)

Then it executes properly. 184 is the number of rows in xtest and ytest, but 230 is not a dimension for any of those structures. Can someone explain the difference or how I'm supposed to be using precision_recall_curve for this purpose?

I don't know where the 230 comes from, but you should really not use scikit-learn estimators on Pandas datastructures. scikit-learn expects NumPy conventions, and Pandas violates some of those (e.g. by turning 1-d arrays into column vectors instead of row vectors). Also, is this a binary classification task? — Fred Foo

rlmlr rlmlr · Accepted Answer · 2013-08-23T18:06:50

Try using the following if this is a binary classification,

predicted = nb.predict_proba(xtest)
precision, recall, threshold = precision_recall_curve(ytest, predicted[:,1])

Input to precision_recall_curve - predict or predict_proba output?

1 Answers