5
votes

When performing classification (for example, logistic regression) with an imbalanced dataset (e.g., fraud detection), is it best to scale/zscore/standardize the features before over-sampling the minority class, or to balance the classes before scaling features?

Secondly, does the order of these steps affect how features will eventually be interpreted (when using all data, scaled+balanced, to train a final model)?

Here's an example:

Scale first:

  1. Split data into train/test folds
  2. Calculate mean/std using all training (imbalanced) data; scale the training data using these calculations
  3. Oversample minority class in the training data (e.g, using SMOTE)
  4. Fit logistic regression model to training data
  5. Use mean/std calculations to scale the test data
  6. Predict class with imbalanced test data; assess acc/recall/precision/auc

Oversample first

  1. Split data into train/test folds
  2. Oversample minority class in the training data (e.g, using SMOTE)
  3. Calculate mean/std using balanced training data; scale the training data using these calculations
  4. Fit logistic regression model to training data
  5. Use mean/std calculations to scale the test data
  6. Predict class with imbalanced test data; assess acc/recall/precision/auc
1

1 Answers

-1
votes

You may have meant it implicitly, but you need to apply the mean/std to scale the training data as well, and that needs to happen before you fit the model.

Barring that point, there isn't a definitive answer on this. The best thing would be to simply try both and see which works best for your data.

For you own understanding of the model on the resulting data, you may want to instead play with computing the mean and standard deviation of the minority and majority classes. If they have similar statistics, then we wouldn't expect much of a difference between scale first or over-sample first.

If the means and standard deviations are very different, the results may differ significantly. But that may also mean the problem has greater separation, and you may expect a higher classification accuracy.