What is the input format for word2vec features in SVM classification task?

Question

I am doing a binary classification task using linear SVM in scikit learn. I use nominal features and word vectors. I obtained the word vectors using the pretrained Google word2vec, however, I am not sure how SVM can handle word vectors as a feature.
It seems that I need to "split" each vector in 300 separate features (=300 vector dimensions), because I can't pass the vector as a whole to SVM. But that doesn't seem right, as the vector should be treated as one feature.
What would be the correct way to represent a vector in this case?

Peteris Peteris · Accepted Answer · 2019-02-23T21:44:27

Vector of many features

From the perspective of an SVM, each dimension of a word vector would be a separate numeric feature - each dimension in that vector represents a numeric metric representing something different.

The same applies for non-SVM classifiers. For example, if you'd have a neural network, and your input features were that word vector of length 300 and (for the sake of a crude example) a bit stating whether that word was capitalized, then you'd concatenate those things and would have 301 numbers as your input; you'd treat that feature just as each of the 300 dimensions.

What is the input format for word2vec features in SVM classification task?

1 Answers

Vector of many features