1
votes

I am doing a simple binary classification and I give you an example of the problem I have: Lets say we have n documents (Doc 1, Doc 2,..., Doc n.) We are going to use TF-IDF as feature values to train a binary classifier using bag-of-words. We have m features for our training files (m technically is the number of unique tokens that we have in all of these n documents after cleaning and pre-processing).

Now, lets say we have a trained model and we are going to predict the label of a new document. We should first pre-process the testing document the same way we did for our training documents. And, then we should use TF-IDF to build a feature vector for our test document. There are two problems here:

  • The number of features are not going to be the same for training and testing sets. I have read some solutions for this one, however, from the scientific point of view I could not be satisfied!
  • It does not really make sense to calculate TF-IDF for only one testing document or even a couple of them. Because the dictionary of tokens that we have in training and testing sets are not necessarily the same and even if we have the same number of features for these two, it does not necessarily mean that these features are the same.

So now I am just trying to figure out how exactly we can label a new document using a model that we trained with bag-of-words model and TF-IDF values. In particular, I am looking for a reasonable answer to the two specific problems that I have mentioned above.

We can calculate the accuracy of the model (for example using cross validation) but I do not know what we should do for labeling a new document.

P.S. I am using scikit-learn and python.

UPDATE: I could find the answer to my question. In such cases, we can simply use the same TfidfVectorizer that we used to train our classifier. So now each time that I train a new classifier and build my feature vectors using tfidfVectorizer, I save my vectorizer in a file using pickle and I use this vectorizer at the time of creating testing set feature vectors.

2
Think about how you would proceed in a real world scenario, where you have trained with all available data, but still at prediction time new text data is present.Vivek Kumar

2 Answers

1
votes

You should figure out all possible features and their IDF weights during the training; at testing time you use features and weights found based on training dataset. Don't compute IDF on test documents.

1) When using bag-of-words approach, the common way is to discard words not seen during the training. If you haven't seen a word during the training you have zero information about it, so it does not affect prediction result.

2) Yes, it doesn't make sense to build vocabulary and compute IDF weights at prediction time. Use features and weights found at training stage.

scikit-learn provides a tutorial which covers this.

It could make sense to fit tf*idf on a dataset larger than the training dataset, to get more precise IDF estimates for words found in training data, but I'm not sure how often are people doing this.

-3
votes

I can not back this up scientifically, but you could try using the dictionary of m features and calculate the TF-IDF scores for those features on the test set. What this will do is create a vector for each test document that is the same size as your train vectors and correlate to the same features used while training your model. You will have to solve the problem with words in the training set that do not show up in the test set though.

Can I ask why you are using TF-IDF and not something like Naive Bayes or Random Forests?