I am doing a simple binary classification and I give you an example of the problem I have: Lets say we have n documents (Doc 1, Doc 2,..., Doc n.) We are going to use TF-IDF as feature values to train a binary classifier using bag-of-words. We have m features for our training files (m technically is the number of unique tokens that we have in all of these n documents after cleaning and pre-processing).
Now, lets say we have a trained model and we are going to predict the label of a new document. We should first pre-process the testing document the same way we did for our training documents. And, then we should use TF-IDF to build a feature vector for our test document. There are two problems here:
- The number of features are not going to be the same for training and testing sets. I have read some solutions for this one, however, from the scientific point of view I could not be satisfied!
- It does not really make sense to calculate TF-IDF for only one testing document or even a couple of them. Because the dictionary of tokens that we have in training and testing sets are not necessarily the same and even if we have the same number of features for these two, it does not necessarily mean that these features are the same.
So now I am just trying to figure out how exactly we can label a new document using a model that we trained with bag-of-words model and TF-IDF values. In particular, I am looking for a reasonable answer to the two specific problems that I have mentioned above.
We can calculate the accuracy of the model (for example using cross validation) but I do not know what we should do for labeling a new document.
P.S. I am using scikit-learn and python.
UPDATE: I could find the answer to my question. In such cases, we can simply use the same TfidfVectorizer that we used to train our classifier. So now each time that I train a new classifier and build my feature vectors using tfidfVectorizer, I save my vectorizer in a file using pickle and I use this vectorizer at the time of creating testing set feature vectors.