sample weights in scikit-learn broken in cross validation

Question

I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. RandomForestClassifier().fit(X,y,sample_weight=weights), but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall:

To pass the weights when using the grid parameter, the usage is:

grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, 
                           fit_params={"sample_weight"=weights})

The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling grid_search.fit(X,y) fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with classifier.fit(sub_X, sub_y, sample_weight=weights) but now weights hasn't been resampled so an exception is thrown.

For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?

xenocyon xenocyon · Accepted Answer · 2014-12-03T03:05:08

Edit: the scores I see from the below don't seem quite right. This is possibly because, as mentioned above, even when weights are used in fitting they might not be getting used in scoring.

It appears that this has been fixed now. I am running sklearn version 0.15.2. My code looks something like this:

model = SGDRegressor()
parameters = {'alpha':[0.01, 0.001, 0.0001]}
cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights})
cv.fit(X, y)

Hope that helps (you and others who see this post).

sample weights in scikit-learn broken in cross validation

3 Answers