How does `sklearn.model_selection.RandomizedSearchCV` work?

Question

I am making a binary classifier with unbalanced classes (of ratio 1:10). I tried KNN, RFs, and XGB classifier. I am getting the best precision-recall tradeoff and F1 score among them from XGB classifer(perhaps because size of dataset is very less - (1900,19))

So after checking error plots for XGB, i decided to go for RandomizedSearchCV() from sklearn for parameter tuning of my XGB classifier. Based on another answer on stackexchange, this is my code :

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
score_arr = []
clf_xgb = XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': [50, 120, 180, 240, 400],
              'learning_rate': [0.01, 0.03, 0.05],
              'subsample': [0.5, 0.7],
              'max_depth': [3, 4, 5],
              'min_child_weight': [1, 2, 3], 
              'scale_pos_weight' : [9]
            }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'precision', error_score = 0, verbose = 3, n_jobs = -1)
print(clf)
numFolds = 6
folds = StratifiedKFold(n_splits = numFolds, shuffle = True)

estimators = []
results = np.zeros(len(X))
score = 0.0
for train_index, test_index in folds.split(X_train, y_train):
    print(train_index)
    print(test_index)
    _X_train, _X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    _y_train, _y_test = y.iloc[train_index].values.ravel(), y.iloc[test_index].values.ravel()
    clf.fit(_X_train, _y_train, eval_metric="error", verbose=True)

    estimators.append(clf.best_estimator_)
    results[test_index] = clf.predict(_X_test)
    score_arr.append(f1_score(_y_test, results[test_index]))
    score += f1_score(_y_test, results[test_index])
score /= numFolds

So RandomizedSearchCV actually selects the classifier and then in kfolds it got fit and predict result on the validation set. Note that i have given X_train and y_train in kfolds split, so that i have a seperate test dataset for testing the final algorithm.

Now, the problem is, if you actually looks the f1-score in each kfold iteration, it is like this score_arr = [0.5416666666666667, 0.4, 0.41379310344827586, 0.5, 0.44, 0.43478260869565216] .

But when I test clf.best_estimator_ as my model, on my test dataset, it gives f1-score of 0.80 and with {'precision': 0.8688524590163934, 'recall': 0.7571428571428571} precision and recall.

How come my score while validation is low and what has happened now on testset? Is my model correct or Did i missed something?

P.S. - Taking the parameters of clf.best_estimator_, i fitted them seperately on my training data using xgb.cv, then also the f1-score was near 0.55. I think this might be due to differences between training approaches of RandomizedSearchCV and xgb.cv. Please tell me if plots or more info needed.

Update : I am attaching error plots of train and test aucpr and classification accuracyfor the generated model. The plot is generated by running model.fit() only once (justifying the values of score_arr).

can you increase the k-value to very high number like 20 and see what happens? — Shihab Shahriar Khan
Yes, i tried with 22 k-folds and it's only increasing the precision and recall at test set and similar set of variable values in score_arr — 7bStan

Prasann Prasann · Accepted Answer · 2020-01-31T08:17:21

Randomized search on hyperparameters.

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favorable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

A budget can be chosen independently of the number of parameters and possible values.

Adding parameters that do not influence the performance does not decrease efficiency.

If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. It is highly recommended to use continuous distributions for continuous parameters.

for more (Reference ): SKLEARN documentation for RandomizedSearchCV

How does `sklearn.model_selection.RandomizedSearchCV` work?

1 Answers