I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link - https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)
cv_folds
that the calculations ofaccuracy_score
andauc
would change with those folds and would not remain constant? – vielkind