3
votes

I am doing a binary classification task using linear SVM in scikit learn. I use nominal features and word vectors. I obtained the word vectors using the pretrained Google word2vec, however, I am not sure how SVM can handle word vectors as a feature.
It seems that I need to "split" each vector in 300 separate features (=300 vector dimensions), because I can't pass the vector as a whole to SVM. But that doesn't seem right, as the vector should be treated as one feature.
What would be the correct way to represent a vector in this case?

1

1 Answers

1
votes

Vector of many features

From the perspective of an SVM, each dimension of a word vector would be a separate numeric feature - each dimension in that vector represents a numeric metric representing something different.

The same applies for non-SVM classifiers. For example, if you'd have a neural network, and your input features were that word vector of length 300 and (for the sake of a crude example) a bit stating whether that word was capitalized, then you'd concatenate those things and would have 301 numbers as your input; you'd treat that feature just as each of the 300 dimensions.