I am working on a binary classification imbalanced marketing dataset which has:
- No:Yes ratio of 88:12 (No-didn't buy the product, yes-bought)
- ~4300 observations and 30 features (9 numeric and 21 categorical)
I divided my data into train (80%) & test (20%) sets and then used standard_scalar & SMOTE on train set. SMOTE made 'No:Yes' ratio of train dataset to 1:1. I then ran a logistic regression classifier as shown in code below and got a recall score of 80% on test data as opposed to only 21% on test data by applying logistic regression classifier without SMOTE.
With SMOTE the recall increase is great, however the false positives are quite high (Refer image for confusion matrix)which is a problem because we will end up targeting many false (unlikely to buy) customers. Is there a way to bring down false positives without sacrificing on recall/true positives?
#Without SMOTE
clf_logistic_nosmote = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train,y_train)
#With SMOTE (resampled train datasets)
clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train_sc_resampled, y_train_resampled)