5
votes

I am trying to use XGBoost scikit wrapper with early stopping in a regression problem. Weirdly enough, the computation of the early stopping eval_metric (in my case, rmse) fails at each early stopping round. That is weird because the same estimator does work on the eval_set without early stopping.

Here is the code:

eval_train_indices=y.dropna()[:-n_splits].index
eval_test_indices=y.dropna()[-n_splits:].index

X_train, X_test=X.loc[eval_train_indices,:], X.loc[eval_test_indices,:]
y_train, y_test = y.loc[eval_train_indices], y.loc[eval_test_indices]

eval_set = [(X_train, y_train), (X_test, y_test)]

predictor=XGBRegressor(n_estimators = 50000, subsample=0.8, **{params})

predictor.fit(X, y,
                  eval_metric=["rmse"], 
                  eval_set=eval_set, 
                  early_stopping_rounds=40,
                  verbose=True)

And the error message it yields :

    <ipython-input-65-358402bfa21c> in fit(self, T)
    147                   early_stopping_rounds=40,
    148                   verbose=True)
    150 
    151         n_estimators=int(self.predictor.best_iteration*1.0)

/Users/Nicolas/anaconda2/lib/python2.7/site-packages/xgboost-0.7-py2.7.egg/xgboost/sklearn.pyc in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model)
    291                               early_stopping_rounds=early_stopping_rounds,
    292                               evals_result=evals_result, obj=obj, feval=feval,
--> 293                               verbose_eval=verbose, xgb_model=xgb_model)
    294 
    295         if evals_result:

/Users/Nicolas/anaconda2/lib/python2.7/site-packages/xgboost-0.7-py2.7.egg/xgboost/training.pyc in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks, learning_rates)
    202                            evals=evals,
    203                            obj=obj, feval=feval,
--> 204                            xgb_model=xgb_model, callbacks=callbacks)
    205 
    206 

/Users/Nicolas/anaconda2/lib/python2.7/site-packages/xgboost-0.7-py2.7.egg/xgboost/training.pyc in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
     97                                end_iteration=num_boost_round,
     98                                rank=rank,
---> 99                                evaluation_result_list=evaluation_result_list))
    100         except EarlyStopException:
    101             break

/Users/Nicolas/anaconda2/lib/python2.7/site-packages/xgboost-0.7-py2.7.egg/xgboost/callback.pyc in callback(env)
    245                                    best_msg=state['best_msg'])
    246         elif env.iteration - best_iteration >= stopping_rounds:
--> 247             best_msg = state['best_msg']
    248             if verbose and env.rank == 0:
    249                 msg = "Stopping. Best iteration:\n{}\n\n"

KeyError: 'best_msg'

For some reason, XGB seems unable to compute the RMSE during the early stopping rounds, although it does work when tested on the eval train and test set without early stopping. When verbose=True, it shows the following :

[0] validation_0-rmse:nan   validation_1-rmse:nan
Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping.

Will train until validation_1-rmse hasn't improved in 40 rounds.
[1] validation_0-rmse:nan   validation_1-rmse:nan
[2] validation_0-rmse:nan   validation_1-rmse:nan
[3] validation_0-rmse:nan   validation_1-rmse:nan
[4] validation_0-rmse:nan   validation_1-rmse:nan
[5] validation_0-rmse:nan   validation_1-rmse:nan
[6] validation_0-rmse:nan   validation_1-rmse:nan
[7] validation_0-rmse:nan   validation_1-rmse:nan
[8] validation_0-rmse:nan   validation_1-rmse:nan
[9] validation_0-rmse:nan   validation_1-rmse:nan
[10]    validation_0-rmse:nan   validation_1-rmse:nan
[11]    validation_0-rmse:nan   validation_1-rmse:nan
[12]    validation_0-rmse:nan   validation_1-rmse:nan
[13]    validation_0-rmse:nan   validation_1-rmse:nan
[14]    validation_0-rmse:nan   validation_1-rmse:nan
[15]    validation_0-rmse:nan   validation_1-rmse:nan
[16]    validation_0-rmse:nan   validation_1-rmse:nan
[17]    validation_0-rmse:nan   validation_1-rmse:nan
[18]    validation_0-rmse:nan   validation_1-rmse:nan
[19]    validation_0-rmse:nan   validation_1-rmse:nan
[20]    validation_0-rmse:nan   validation_1-rmse:nan
[21]    validation_0-rmse:nan   validation_1-rmse:nan
[22]    validation_0-rmse:nan   validation_1-rmse:nan
[23]    validation_0-rmse:nan   validation_1-rmse:nan
[24]    validation_0-rmse:nan   validation_1-rmse:nan
[25]    validation_0-rmse:nan   validation_1-rmse:nan
[26]    validation_0-rmse:nan   validation_1-rmse:nan
[27]    validation_0-rmse:nan   validation_1-rmse:nan
[28]    validation_0-rmse:nan   validation_1-rmse:nan
[29]    validation_0-rmse:nan   validation_1-rmse:nan
[30]    validation_0-rmse:nan   validation_1-rmse:nan
[31]    validation_0-rmse:nan   validation_1-rmse:nan
[32]    validation_0-rmse:nan   validation_1-rmse:nan
[33]    validation_0-rmse:nan   validation_1-rmse:nan
[34]    validation_0-rmse:nan   validation_1-rmse:nan
[35]    validation_0-rmse:nan   validation_1-rmse:nan
[36]    validation_0-rmse:nan   validation_1-rmse:nan
[37]    validation_0-rmse:nan   validation_1-rmse:nan
[38]    validation_0-rmse:nan   validation_1-rmse:nan
[39]    validation_0-rmse:nan   validation_1-rmse:nan
[40]    validation_0-rmse:nan   validation_1-rmse:nan

I don't even understand what could cause a failure to compute RMSE. It may be due to missing values but there are not when I print predictor.predict(X_test)

2
the error is caused by missing values in X. XGB can handle missing values in the model.fit(X, y) but not in the early stopping rounds, for some unknown reasons.NicolasWoloszko

2 Answers

1
votes

It is due to Nan values; try to remove or substitute them and check whether it works.

0
votes

I have this issue only after upgrading to xgboost=0.80 in order to use the SHAP module. Prior versions of xgboost=0.6a1 run fine.