Computing intermediate results only once in scikit-learn GridSearchCV

Question

I have an estimator like this:

    from sklearn.base import BaseEstimator, TransformerMixin
    import numpy as np

    class customEstimator(BaseEstimator, TransformerMixin):
        def __init__(self, estimator_var):
            self.estimator_var = estimator_var
        def transform(self, X):
            self.tmpVar = np.random.randn(estimator_var, estimator_var)
            return np.hstack((self.tmpVar, X)) # this is just an example
        def fit(self, X, y=None):
            return self
        def get_params(self, deep=False):
            return {'estimator_var': self.estimator_var, 'tmpVar': tmpVar}

I then create a pipeline with the estimator in it (and others) and feed it into GridSearchCV for k-fold cross validation. k-fold cross validation goes something like this:

for all possible params combination:
   for every fold split
     compute score(mini_train, mini_test)
   compute average score
pick best combination

The issue is that, for a given combination of parameters, I would like to compute self.tmpVar (which may be slow to compute) only once, and use it for all fold splittings that share the same combination of parameters.

Would that be possible in scikit-learn, or is there a workaround?

lejlot lejlot · Accepted Answer · 2017-01-25T21:24:29

Simply store this variable as a static attribute of your class, or in any other global name scope.

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class customEstimator(BaseEstimator, TransformerMixin):

    tmpVar = None

    def __init__(self, estimator_var):
        self.estimator_var = estimator_var
    def transform(self, X):
        if customEstimator.tmpVar is None:
            customEstimator.tmpVar = np.random.randn(estimator_var, estimator_var)
        return np.hstack((customEstimator.tmpVar, X)) # this is just an example
    def fit(self, X, y=None):
        return self
    def get_params(self, deep=False):
        return {'estimator_var': self.estimator_var}

Of course here the problem is if you reuse your estimator many times, with different data, sometimes you do want to reset. Then you can simply have a name for each estimator, and store these tmpVars in a map (dictionary) with these names as keys. You can even make names to be generated automatically, by something among the lines of:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class customEstimator(BaseEstimator, TransformerMixin):

    tmpVars = {}
    estimators = 0

    def __init__(self, estimator_var, name=None):
        if name is None:
            customEstimator.estimators = customEstimator.estimators + 1
            name = 'Estimator %d' % customEstimator.estimators
        self.name = name
        self.estimator_var = estimator_var
    def transform(self, X):
        if self.name not in customEstimator.tmpVar:
            customEstimator.tmpVar[self.name] = np.random.randn(estimator_var, estimator_var)
        return np.hstack((customEstimator.tmpVar[self.name], X)) # this is just an example
    def fit(self, X, y=None):
        return self
    def get_params(self, deep=False):
        return {'estimator_var': self.estimator_var, 'name': self.name}

This way if you create a new instance of customEstimator, it will get a new name, but if it is cloned by scikit-learn, they will share the same name (and consequently - data).

Computing intermediate results only once in scikit-learn GridSearchCV

1 Answers