7
votes

Is there a convenient mechanism for locking steps in a scikit-learn pipeline to prevent them from refitting on pipeline.fit()? For example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
firsttwoclasses = data.target<=1
y = data.target[firsttwoclasses]
X = np.array(data.data)[firsttwoclasses]

pipeline = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("estimator", LinearSVC())
])

# fit intial step on subset of data, perhaps an entirely different subset
# this particular example would not be very useful in practice
pipeline.named_steps["vectorizer"].fit(X[:400])
X2 = pipeline.named_steps["vectorizer"].transform(X)

# fit estimator on all data without refitting vectorizer
pipeline.named_steps["estimator"].fit(X2, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

# fitting entire pipeline refits vectorizer
# is there a convenient way to lock the vectorizer without doing the above?
pipeline.fit(X, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

The only way I could think of doing this without intermediate transformations would be to define a custom estimator class (as seen here) whose fit method does nothing and whose transform method is the transform of the pre-fit transformer. Is this the only way?

2

2 Answers

3
votes

Looking through the code, there doesn't seem to be anything in a Pipeline object with functionality like this: calling .fit() on the pipeline results in .fit() on each stage.

The best quick-and-dirty solution I could come up with is to monkey-patch away the stage's fitting functionality:

pipeline.named_steps["vectorizer"].fit(X[:400])
# disable .fit() on the vectorizer step
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform

pipeline.fit(X, y)
0
votes

you can take a subset of your pipeline like

preprocess_pipeline = Pipeline(pipeline.best_estimator_.steps[:-1]) # exclude the last step

and then

tmp = preprocess_pipeline.fit(x_train) normalized_x = tmp.fit_transform(x_train)