0
votes

So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:

# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')

selector = RFECV(LSVM, step=1, cv=3, scoring='f1')

param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]

clf = make_pipeline(StandardScaler(), 
                GridSearchCV(selector,
                             param_grid,
                             cv=3,
                             refit=True,
                             scoring='f1'))

clf.fit(X, Y)

I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.

I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.

Thanks!

Update:

So I put the StandardScaler inside the RFECV like this:

 clf = SVC(kernel='linear')

 kf = KFold(n_splits=3, shuffle=True, random_state=0)  

 estimators = [('standardize' , StandardScaler()),
               ('clf', clf)]

 class Mypipeline(Pipeline):
     @property
     def coef_(self):
         return self._final_estimator.coef_
     @property
     def feature_importances_(self):
         return self._final_estimator.feature_importances_ 

 pipeline = Mypipeline(estimators)
 rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)

 param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]

 clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)

But it still throws out the following error:

ValueError: Invalid parameter C for estimator Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().

1
Yes, you are correct. That make_pipeline should be inside the RFECV containing StandardScaler and SVC. GridSearchCV should be outer. But even then, you cannot say for sure that the code will not finitely. Its just an issue of Linear SVM's not able to converge on the given data and may run for a long time. Combine that with the RFE and GridSearch, which will increase the running time.Vivek Kumar
Okay but now it throws out an error (see the edit).chmscrbbrfck
Since you have now changed the structure, you need to also change the parameter names. Correct name would be estimator__svc__C. But then you will face errors on RFECV. Because it needs the coef_ of SVC which is not exposed by the pipeline. See this questionVivek Kumar
Also see this for more explanationVivek Kumar
That is complex wow. So it turns out that StandardScaler() wouldn't speed up the process, so is it valid to not use it at all (not normalize the data (?)). I have used the same code (without StandardScaler()) for Logistic Regression, it only took 15 minutes and spit out a good accuracy. Now, I just want to train an SVM for comparison, so is it safe to assume that even though the problem is linear which LR resolved easily, it can still be hard for Linear SVM (?)chmscrbbrfck

1 Answers

-1
votes

Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.