1
votes

I have a strange problem where I have a model with 4 clusters, the data is unbalanced at the following proportions: 75%, 15%, 7% and 3%. I split it into train and test with 80/20 proportion, then I train a KNN with 5 neighbors, giving me an acurracy of 1.

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

train_index, test_index = next(sss.split(X, y))

x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]

KNN_final = KNeighborsClassifier()
KNN_final.fit(x_train, y_train)

y_pred = KNN_final.predict(x_test)

print('Avg. accuracy for all classes:', metrics.accuracy_score(y_test, y_pred))
print('Classification report: \n',metrics.classification_report(y_test, y_pred, digits=2))

Avg. accuracy for all classes: 1.0
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       140
           1       1.00      1.00      1.00        60
           2       1.00      1.00      1.00       300
           3       1.00      1.00      1.00      1500

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000

Although it seems strange, I keep going, get new data and try to classify it based on this model, but it never finds the class with smaller percentage, it always misclassifies it as the second lower class. So I try to balance the data using the imbalance learn library with SMOTEENN algorithm:

Original dataset shape Counter({3: 7500, 2: 1500, 0: 700, 1: 300})

sme = SMOTEENN(sampling_strategy='all', random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

Resampled dataset shape Counter({0: 7500, 1: 7500, 2: 7500, 3: 7500})

Then I do the same thing, split it into train and test with the same proportion 80/20 and train a new KNNclassifier with 5 neighbors. But the classification report seems even worse now:

Avg. accuracy for all classes: 1.0
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      1500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000

I don't see what I'm doing wrong, is there any process I need to do after resampling the data, other than split and shuffle, before training a new classifier? Why my KNN is not seeing 4 classes now?

1

1 Answers

0
votes

Although a full investigation requires your data, which you do not provide, such behavior is (at least partially) consistent with the following scenario:

  1. You have duplicates (possibly a lot) in your initial data
  2. Due to these duplicates, some (most? all?) of your test data are actually not new/unseen, but copies of samples in your training data, which leads to an unreasonably high test accuracy of 1.0
  3. When adding new data (no duplicates of your initial ones), the model unsurprisingly fails to fulfill the expectations created from such a high accuracy (1.0) in the test data.

Notice that the stratified split will not protect you from such a scenario; here is a demonstration with toy data, adapted from the documentation:

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 1, 0, 1, 0, 1])

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
train_index, test_index = next(sss.split(X, y))

X[train_index]
# result:
array([[3, 4],
       [1, 2],
       [3, 4]])

X_[test_index]
# result:
array([[3, 4],
       [1, 2],
       [1, 2]])