Combining heterogenous features in scikit-learn

Question

I'm doing binary classification over some documents whose features are already extracted and given in a text file. My problem is that there are textual features and numerical features like years and some other. One sample is given in this format:

label |title text |otherText text |numFeature1 number |numFeature2 number

I'm following the documentation about feature unions but their use case is a bit different. I do not extract the the features from another feature because these numerical features are already given.

Currently I'm using the setup in the following way:

pipeline = Pipeline([
('features', Features()),

('union', FeatureUnion(
    transformer_list=[
        ('title', Pipeline([
            ('selector', ItemSelector(key='title')),
            ('tfidf', TfidfVectorizer()),
        ])),
        ('otherText', Pipeline([
            ('selector', ItemSelector(key='otherText')),
            ('tfidf', TfidfVectorizer()),
        ])),
        ('numFeature1', Pipeline([
            ('selector', ItemSelector(key='numFeature1')),
        ])),
        ('numFeature2', Pipeline([
            ('selector', ItemSelector(key='numFeature2')),
        ])),
    ],
)),
('classifier', MultinomialNB()),
])

The Feature class is also adopted from the documentation:

class Features(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, posts):
    features = np.recarray(shape=(len(posts),),
                           dtype=[('title', object),('otherText', object),
                                  ('numFeature1', object),('numFeature2', object)])

    for i, text in enumerate(posts):
        l = re.split("\|\w+", text)
        features['title'][i] = l[1]
        features['otherText'][i] = l[2]
        features['numFeature1'][i] = l[3]
        features['numFeature2'][i] = l[4]

    return features

My Problem is now: How do I add the numerical features into the FeatureUnion? When using a CountVectorizer i get "ValueError: empty vocabulary; perhaps the documents only contain stop words" and using a DictVectorizer with only one entry doesn't strike me as the way to go.

Just use ItemSelector() class with key='numFeature1' and 'numFeature2' — Vivek Kumar
this returns ValueError: blocks[0,:] has incompatible row dimensions — lup3x

Vivek Kalyanarangan Vivek Kalyanarangan · Accepted Answer · 2017-02-03T11:37:05

the TfidfVectorizer() object has not been fitted with data yet.

Before constructing the pipeline, do this -

vec = TfidfVectorizer()
vec.fit(data['free text column'])
pipeline = Pipeline([
('features', Features()),

('union', FeatureUnion(
    transformer_list=[
        ('title', Pipeline([
            ('selector', ItemSelector(key='title')),
            ('tfidf', vec),
        ])),

        ... other features

This helps if you want to fit your data again for test purposes... because for test data the pipeline would automatically use transform() function for the TfidfVectorizer instead of fit() function which you have to explicitly do before constructing the pipeline

Combining heterogenous features in scikit-learn

2 Answers