ROC_AUC_SCORE is different while calculating using predict() vs predict_proba() in Random Forest

Question

Both predict() vs predict_proba() gives different roc_auc_score in Random Forest.

I understand that predict_proba() gives probabilities such as in case of Binary Classification it will gives two probabilities corresponding both classes. predict() gives class it predicted.

    #Using predict_proba()
    rf = RandomForestClassifier(n_estimators=200, random_state=39)
    rf.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

    #make predictions on train and test set
    pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'Sex']])
    pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

    print('Train set')
    print('Random Forests using predict roc-auc: {}'.format(roc_auc_score (y_train, pred_train)))

    print('Test set')
    print('Random Forests using predict roc-auc: {}'.format(roc_auc_score(y_test, pred_test)))

   #using predict()

   pred_train = rf.predict(X_train[['Cabin_reduced', 'Sex']])
   pred_test = rf.predict(X_test[['Cabin_reduced', 'Sex']])

   print('Train set')
   print('Random Forests using predict roc-auc: {}'.format(roc_auc_score(y_train, pred_train)))
   print('Test set')
   print('Random Forests using predict roc-auc: {}'.format(roc_auc_score(y_test, pred_test)))

Train set Random Forests using predict_proba roc-auc: 0.8199550985878832

Test set Random Forests using preditc_proba roc-auc: 0.8332142857142857

Train set Random Forests using predict roc-auc: 0.7779440793041364

Test set Random Forests using predict roc-auc: 0.7686904761904761

Jindřich Jindřich · Accepted Answer · 2019-05-31T15:31:16

As you said, the predict function returns the prediction as True/False value, whereas proba function returns probabilities, values between one and zero and this is the reason for the difference.

AUC means "area under the curve" which is indeed different if the curve is a 0/1 step function or a curve made of continuous values.

Let's imagine you have only one example, it should be classified as False. If your classifier yields the probability of 0.7, the ROC-AUC value is 1.0-0.7=0.3. If you used predict, the prediction will be True = 1.0, so the ROC-AUC will be 1.0-1.0=0.0.

ROC_AUC_SCORE is different while calculating using predict() vs predict_proba() in Random Forest

2 Answers