How to choose n_estimators in RandomForestClassifier?

Question

I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
    rfc = RandomForestClassifier(n_estimators=k)
    rfc.fit(x_train, y_train)
    y_pred = rfc.predict(x_test)
    scores.append(accuracy_score(y_test, y_pred))

import matplotlib.pyplot as plt
%matplotlib inline

# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')

Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.

You should choose a value around the moment the performance starts to stabilize on the curve. You shouldn't try to choose a particular value, the differences of performances between two close values of n_estimator come from variabality due to randomness and will not be replicated to new data — ThomaS
stepwise refinement is one way to find in efficiency improvement. Try using GridSearch and cross folding to find the best parameters — Golden Lion

Golden Lion Golden Lion · Accepted Answer · 2021-03-06T14:14:21

see (https://github.com/dnishimoto/python-deep-learning/blob/master/Random%20Forest%20Tennis.ipynb) randomsearchcv example

I used RandomSearchCV to find the best params for the Random Forest Classifier

n_estimators is the number of decision trees to use.

try using XBBoost to get more accuracy.

parameter_grid={'n_estimators':[1,2,3,4,5],'max_depth':[2,4,6,8,10],'min_samples_leaf': 
[1,2,4],'max_features':[1,2,3,4,5,6,7,8]}

number_models=4
random_RandomForest_class=RandomizedSearchCV(
estimator=pipeline['clf'],
param_distributions=parameter_grid,
n_iter=number_models,
scoring='accuracy',
n_jobs=2,
cv=4,
refit=True,
return_train_score=True)

random_RandomForest_class.fit(X_train,y_train)
predictions=random_RandomForest_class.predict(X)

print("Accuracy Score",accuracy_score(y,predictions));
print("Best params",random_RandomForest_class.best_params_)
print("Best score",random_RandomForest_class.best_score_)

How to choose n_estimators in RandomForestClassifier?

3 Answers