I would like to use a pipeline including a TfidfVectorizer and a SVC. However, in between, I would like to concatenate some features extracted from non-textual data to the output of the TfidfVectorizer.
I have tried creating a custom class (approach based on this tutorial) to do this but this does not seem to work.
Here is what I have tried so far:
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('transformer', CustomTransformer(one_hot_feats)),
('clf', MultinomialNB()),
])
parameters = {
'tfidf__min_df': (5, 10, 15, 20, 25, 30),
'tfidf__max_df': (0.8, 0.9, 1.0),
'tfidf__ngram_range': ((1, 1), (1, 2)),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': np.linspace(0.1, 1.5, 15),
'clf__fit_prior': [True, False],
}
grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(df["short description"], labels)
Here is the CustomTransformer class
class CustomTransformer(TransformerMixin):
"""Class that concatenates the one hot encode category feature with the tfidf data."""
def __init__(self, one_hot_features):
"""Initializes an instance of our custom transformer."""
self.one_hot_features = one_hot_features
def fit(self, X, y=None, **kwargs):
"""Dummy fit function that does nothing particular."""
return self
def transform(self, X, y=None, **kwargs):
"""Adds our external features"""
return numpy.hstack((one_hot_feats, X))
This approach works as long as X does not change dimensions inside the custom class (probably a limitation related to the TransformerMixin), however, in my case, I will have additional features appended to my data. Should my custom class inherit from a different base class or is there a different approach to solve this?