0
votes

I'm a new Python user and have been running a Naive Bayes classifier model using the scikit-learn module. Is the following example code on the scikit learn Naïve Bayes documentation page correct?

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points : %d"

Shouldn't the gnb.fit() function instead read:

y_pred = gnb.fit(iris.data.drop(columns=['target']), iris.target).predict(iris.data)

That is, the response variable needs to be manually removed from the predictor dataset. I was getting unreasonably high accuracy metrics for my model when a colleague pointed out that the code I had cribbed from the scikit-learn documentation page is wrong.

1

1 Answers

2
votes

iris.data is not a dataframe, it's just a (150,4) numpy array with the 4 features.

iris.target is another numpy array with just the target class.

Not sure how you could call drop on the array (I just checked that I have an array and not a pd df, which makes sense, sklearn doesn't depend on pandas).