1
votes

I am working with the scikit-learn random forest classifier and I want to reduce the FP rate by increasing the number of trees needed for a successful vote from greater than 50% to say 75%, after reading the documentation I am not sure of how to do this. Does anyone have any suggestions. (I think there should be a way to do this because according to the documentation the predict method of the classifier decides based on a majority vote). All help appreciated, thanks!

1
predict_proba() >= 0.75? - Thomas Jungblut
@hiqbal dont forget to accept my answer in case it suits you... :) - omerbp

1 Answers

1
votes

Lets say you now have a classifier that use as 75% agreement within all the estimators. In case it gets a new sample, and the odds are 51%-49% in favour of one class, what do you want it to do?

The reason that 50% rule is used, is because a decision rule as you proposed may lead you to cases where the classifier says "I cannot predict the label of these samples".

What you can do, is to wrap the results of the classifier, and do whatever calculations you wish -

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np

def my_decision_function(arr):
    diff = np.abs(arr[:,0]-arr[:,1])
    arr [ diff < 0.5 ] = [-1,-1] # if >0.5, one class has more than 0.75 prediction
    return arr


X, y = datasets.make_classification(n_samples=100000, n_features=20,
                                n_informative=2, n_redundant=2)
train_samples = 100  # Samples used for training the models

X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

clf = RandomForestClassifier().fit(X_train,y_train)
print my_decision_function(clf.predict_proba(X_train))

Now, each sample with less than 0.75% for at least one class will have [-1,-1] prediction. Some adjustments most be made if you use do multi-label classification, but I hope the notion is clear.