How to normalize tf-idf vectors for SVMs?

Question

I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N. Giving M * N matrix.

The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with

val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)

It either gives me 0, 1 of course.

And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815

Any recommendations ?

Should i include more documents ? or other functions like sigmoid or better normalization methods ?

lejlot lejlot · Accepted Answer · 2013-08-14T09:57:07

The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .

Getting back to the problem itself:

10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents

finally why is simple features squashing

It either gives me 0, 1 of course.

it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.

How to normalize tf-idf vectors for SVMs?

1 Answers