3
votes

I am new to machine learning and currently working on a project with imbalance data. I want to balance the data using random undersampling. I am confused if i should do the undersampling after test train split or should i do undersampling 1st and then do train test split?

My approach : 1. I used train test split to get : X_train, y_train for training and X_test and y_test for testing. 2. I combined X_train and y_train into one data set and did the undersampling. 3. After undersampling, i performed Cross validation and model selection based on F1 score and using X_test.,Y_test for prediction.

Is my approach correct? Please correct me if i am wrong.

1
Can you provide the ratio of the classes? Also the total number of samplesMehul Gupta
It appears both orders of operations could make sense for their respective problems. Can you tell us some more about what problem you are trying to solve? That puts a constraint on your assumptions.hyiltiz
Class 0 : 50140, Class 1 : 4668. I want to keep a test data, that is not a subset of undersampled data set,for checking the accuracy of the model. This is the reason i want to do train test split 1st s that i will do undersampling over the train data and check the accuracy using the test data.sarika

1 Answers

2
votes

Let's go through your approach:

I used train test split to get : X_train, y_train for training and X_test and y_test for testing. I combined X_train and y_train into one data set and did the undersampling.

That's right. Any resampling techniques should be applied only on the train set. This will ensure that the test set reflects the reality. The model performance obtained on such a test set will be a good estimate of your model's generalization ability. If the resampling is performed on the whole dataset, your model's performance is going to be overly optimistic.

After undersampling, i performed Cross validation and model selection based on F1

It's difficult to understand what exactly has been done without the code, but it seems you've done the cross validation on already resampled train data. That's wrong, and the undersampling should have been done on each test fold during cross validation. Let's consider 3-fold CV the way it should be done:

  1. Train set is divided in 3 folds. 2 folds are used for training, 1 - for testing.
  2. You apply resampling on these 2 folds, train your model and then estimate the performance on the untouched 1 fold.
  3. Repeat steps 1-2 on until each fold is used as a test set.

Thus, what you should do is: 1. Split the data on train and test. 2. Perform CV on your trains set. Apply undersampling only on a test fold. 3. After the model has been chosen with the help of CV, undersample your train set and train the classifier. 4. Estimate the performance on the untouched test set.