0
votes

I am working on a binary classification imbalanced marketing dataset which has:

  1. No:Yes ratio of 88:12 (No-didn't buy the product, yes-bought)
  2. ~4300 observations and 30 features (9 numeric and 21 categorical)

I divided my data into train (80%) & test (20%) sets and then used standard_scalar & SMOTE on train set. SMOTE made 'No:Yes' ratio of train dataset to 1:1. I then ran a logistic regression classifier as shown in code below and got a recall score of 80% on test data as opposed to only 21% on test data by applying logistic regression classifier without SMOTE.

With SMOTE the recall increase is great, however the false positives are quite high (Refer image for confusion matrix)which is a problem because we will end up targeting many false (unlikely to buy) customers. Is there a way to bring down false positives without sacrificing on recall/true positives?

confusion matrix

#Without SMOTE
clf_logistic_nosmote = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train,y_train)

#With SMOTE  (resampled train datasets) 
clf_logistic = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train_sc_resampled, y_train_resampled)
1
whats the the value_counts of each class at the beginning?? How many features are you using in total?after feature engineering if that takes place?Herc01
total samples= 4334, value_counts: 0 = 3832 (88%), 1 = 502 (12%) total features after feature engineering = 30 (9 numeric, 21 categorical)Vikrant Arora

1 Answers

0
votes

Even I had a similar issue, where the false positives where very high. In that case I had applied SMOTE after doing feature engineering.

Then I had used SMOTE before doing feature engineering and used SMOTE generated data to extract the features. That way it worked pretty well. Although, this will be a slower approach, but it worked out for me. Let me know how it goes for you.