I have a strange problem where I have a model with 4 clusters, the data is unbalanced at the following proportions: 75%, 15%, 7% and 3%. I split it into train and test with 80/20 proportion, then I train a KNN with 5 neighbors, giving me an acurracy of 1.
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
train_index, test_index = next(sss.split(X, y))
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
KNN_final = KNeighborsClassifier()
KNN_final.fit(x_train, y_train)
y_pred = KNN_final.predict(x_test)
print('Avg. accuracy for all classes:', metrics.accuracy_score(y_test, y_pred))
print('Classification report: \n',metrics.classification_report(y_test, y_pred, digits=2))
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 140
1 1.00 1.00 1.00 60
2 1.00 1.00 1.00 300
3 1.00 1.00 1.00 1500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
Although it seems strange, I keep going, get new data and try to classify it based on this model, but it never finds the class with smaller percentage, it always misclassifies it as the second lower class. So I try to balance the data using the imbalance learn library with SMOTEENN algorithm:
Original dataset shape Counter({3: 7500, 2: 1500, 0: 700, 1: 300})
sme = SMOTEENN(sampling_strategy='all', random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 7500, 1: 7500, 2: 7500, 3: 7500})
Then I do the same thing, split it into train and test with the same proportion 80/20 and train a new KNNclassifier with 5 neighbors. But the classification report seems even worse now:
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 1500
1 1.00 1.00 1.00 500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
I don't see what I'm doing wrong, is there any process I need to do after resampling the data, other than split and shuffle, before training a new classifier? Why my KNN is not seeing 4 classes now?