Let's go through your approach:
I used train test split to get : X_train, y_train for training and X_test and y_test for testing. I combined X_train and y_train into one data set and did the undersampling.
That's right. Any resampling techniques should be applied only on the train set. This will ensure that the test set reflects the reality. The model performance obtained on such a test set will be a good estimate of your model's generalization ability. If the resampling is performed on the whole dataset, your model's performance is going to be overly optimistic.
After undersampling, i performed Cross validation and model selection based on F1
It's difficult to understand what exactly has been done without the code, but it seems you've done the cross validation on already resampled train data. That's wrong, and the undersampling should have been done on each test fold during cross validation. Let's consider 3-fold CV the way it should be done:
- Train set is divided in 3 folds. 2 folds are used for training, 1 - for testing.
- You apply resampling on these 2 folds, train your model and then estimate the performance on the untouched 1 fold.
- Repeat steps 1-2 on until each fold is used as a test set.
Thus, what you should do is:
1. Split the data on train and test.
2. Perform CV on your trains set. Apply undersampling only on a test fold.
3. After the model has been chosen with the help of CV, undersample your train set and train the classifier.
4. Estimate the performance on the untouched test set.