I'm using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the "best" parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I'm surprised this would happen.
For example:
from sklearn import svm, grid_search
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(verbose=1)
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],
'min_samples_split':[2,5,10,20],
'max_depth':[2,3,5,10]}
clf = grid_search.GridSearchCV(gbc, parameters)
t0 = time()
clf.fit(X_crossval, labels)
print "Gridsearch time:", round(time() - t0, 3), "s"
print clf.best_params_
# The output is: {'min_samples_split': 2, 'learning_rate': 0.01, 'max_depth': 2}
This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default.
One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation. Then after obtaining the parameters, I ran it using the same dataset, split into 75-25 training/testing.
Is there a reason my grid search overlooked the "superior" defaults?
cross_validation
scheme? – juanpa.arrivillaga