I'm doing binary classification over some documents whose features are already extracted and given in a text file. My problem is that there are textual features and numerical features like years and some other. One sample is given in this format:
label |title text |otherText text |numFeature1 number |numFeature2 number
I'm following the documentation about feature unions but their use case is a bit different. I do not extract the the features from another feature because these numerical features are already given.
Currently I'm using the setup in the following way:
pipeline = Pipeline([
('features', Features()),
('union', FeatureUnion(
transformer_list=[
('title', Pipeline([
('selector', ItemSelector(key='title')),
('tfidf', TfidfVectorizer()),
])),
('otherText', Pipeline([
('selector', ItemSelector(key='otherText')),
('tfidf', TfidfVectorizer()),
])),
('numFeature1', Pipeline([
('selector', ItemSelector(key='numFeature1')),
])),
('numFeature2', Pipeline([
('selector', ItemSelector(key='numFeature2')),
])),
],
)),
('classifier', MultinomialNB()),
])
The Feature class is also adopted from the documentation:
class Features(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('title', object),('otherText', object),
('numFeature1', object),('numFeature2', object)])
for i, text in enumerate(posts):
l = re.split("\|\w+", text)
features['title'][i] = l[1]
features['otherText'][i] = l[2]
features['numFeature1'][i] = l[3]
features['numFeature2'][i] = l[4]
return features
My Problem is now: How do I add the numerical features into the FeatureUnion? When using a CountVectorizer i get "ValueError: empty vocabulary; perhaps the documents only contain stop words" and using a DictVectorizer with only one entry doesn't strike me as the way to go.
ValueError: blocks[0,:] has incompatible row dimensions
– lup3x