16
votes

I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. RandomForestClassifier().fit(X,y,sample_weight=weights), but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall:

To pass the weights when using the grid parameter, the usage is:

grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, 
                           fit_params={"sample_weight"=weights})

The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling grid_search.fit(X,y) fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with classifier.fit(sub_X, sub_y, sample_weight=weights) but now weights hasn't been resampled so an exception is thrown.

For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?

3

3 Answers

7
votes

Edit: the scores I see from the below don't seem quite right. This is possibly because, as mentioned above, even when weights are used in fitting they might not be getting used in scoring.

It appears that this has been fixed now. I am running sklearn version 0.15.2. My code looks something like this:

model = SGDRegressor()
parameters = {'alpha':[0.01, 0.001, 0.0001]}
cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights})
cv.fit(X, y)

Hope that helps (you and others who see this post).

7
votes

I have too little reputation so I can't comment on @xenocyon. I'm using sklearn 0.18.1 and I'm using also pipeline in the code. The solution that worked for me was:

fit_params={'classifier__sample_weight': w} where w is the weight vector and classifier is the step name in the pipeline.

2
votes

I would suggest writing own cross validation parameters selection, as it is just 10-15 lines of code (especially using the kfold object from scikit-learn) in python, while oversampling is possibly a great bottleneck.