0
votes

I am working on an intrusion classification problem using NSL-KDD dataset. I used 10 features (out of 42) for training after applying Recursive feature elimination technique using Random Forest Classifier as the estimator parameter and Gini index as criterion for splitting Decision tree. After training the classifier, I use same classifier to predict the classes of test data. My cross validation score (Accuracy, precision, recall, f-score) using cross_val_score of sklearn gave above 99 % scores for all the four scores. But plotting the confusion matrix showed otherwise with higher values seen in False positive and False negative values. Claerly, they are not matching with accuracy and all these scores. Where did I do wrong ?

# Train set contain X_train (dataframe of features) and Y_train (series 
# of target labels)
# Test set contain X_test and Y_test

# Classifier variable
clf = RandomForestClassifier(n_estimators = 10, criterion = 'gini')

#Training
clf.fit(X_train, Y_train)

# Testing
Y_pred = clf.predict(X_test)
pandas.crosstab(Y_test, Y_pred, rownames = ['Actual'], colnames = 
['Predicted'])

# Scoring
accuracy = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 
'accuracy')
print("Accuracy: %0.5f (+/- %0.5f)" % (accuracy.mean(), accuracy.std() * 
2))
precision = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 
'precision_weighted')
print("Precision: %0.5f (+/- %0.5f)" % (precision.mean(), precision.std() 
* 2))
recall = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 
'recall_weighted')
print("Recall: %0.5f (+/- %0.5f)" % (recall.mean(), recall.std() * 2))
f = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 'f1_weighted')
print("F-Score: %0.5f (+/- %0.5f)" % (f.mean(), f.std() * 2))

I got accuracy, precision, recall and f-score of

Accuracy 0.99825 
Precision 0.99826
Recall 0.99825
F-Score 0.99825

However, the confusion matrix showed otherwise

Predicted 9670    41
Actual    5113    2347

Am I training the whole thing wrong or is it just misclassification problem from poor feature selection?

2

2 Answers

2
votes

Your predicted values are stored in y_pred.

accuracy_score(y_test,y_pred)

Just check whether this works...

0
votes

You are not comparing equivalent results! For the confusion matrix, you train on (X_train,Y_train) and test on (X_test,Y_test). However, the crossvalscore fits the estimator on k-1 folds of (X_test,Y_test) and test it on the remaining fold of (X_test,Y_test) because crossvalscore do its own cross-validation (with 10 folds here) on the dataset you provide. Check out crossvalscore documentation for more explanation.

So basically, you don't fit and test your algorithm on the same data. This might explain some inconsistency in the results.