3
votes

I would like to use a pipeline including a TfidfVectorizer and a SVC. However, in between, I would like to concatenate some features extracted from non-textual data to the output of the TfidfVectorizer.

I have tried creating a custom class (approach based on this tutorial) to do this but this does not seem to work.

Here is what I have tried so far:

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('transformer', CustomTransformer(one_hot_feats)),
    ('clf', MultinomialNB()),
])

parameters = {
    'tfidf__min_df': (5, 10, 15, 20, 25, 30),
    'tfidf__max_df': (0.8, 0.9, 1.0),
    'tfidf__ngram_range': ((1, 1), (1, 2)),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': np.linspace(0.1, 1.5, 15),
    'clf__fit_prior': [True, False],
}

grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(df["short description"], labels)

Here is the CustomTransformer class

class CustomTransformer(TransformerMixin):
"""Class that concatenates the one hot encode category feature with the tfidf data."""

def __init__(self, one_hot_features):
    """Initializes an instance of our custom transformer."""
    self.one_hot_features = one_hot_features

def fit(self, X, y=None, **kwargs):
    """Dummy fit function that does nothing particular."""

    return self

def transform(self, X, y=None, **kwargs):
    """Adds our external features"""
    return numpy.hstack((one_hot_feats, X))   

This approach works as long as X does not change dimensions inside the custom class (probably a limitation related to the TransformerMixin), however, in my case, I will have additional features appended to my data. Should my custom class inherit from a different base class or is there a different approach to solve this?

1

1 Answers

3
votes

You can combine multiple features using Sklearn's FeatureUnion, and transform specific columns using ColumnTransformer:

from docs:

FeatureUnion

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

ColumnTransformer

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

In your case you can do that using make_column_transformer

from sklearn.compose import make_column_transformer
pipeline = Pipeline([
    ('transformer',  make_column_transformer((TfidfVectorizer(), ['text_column']),
                                             (OneHotEncoder(), ['categorical_column']),)),
    ('clf', MultinomialNB()),
])

EDIT:

set remainder to 'passthrough' in make_column_transformer so all remaining columns that were not specified in transformers will be automatically passed through.