TD/IDF in scikit-learn

Question

Is there a complete Python 2.7 example about how to use TfidfTransformer (http://scikit-learn.org/stable/modules/feature_extraction.html) to generate TF/IDF for n-grams for a corpus? Look around scikit-learn pages and it only have code snippet (not complete samples).

regards, Lin

Hi Lin, This is a piece of code that I had written for a project. Though it is not well documented but the functions are very well named. Maybe you can look at it. — Abhinav Arora
Thank you for the encouragement. I will draft an answer shortly. Answers to your questions are as follows: 1. The corpus ia a collection of earnings call transcripts of publicly listed companies. Each earnings call transcript is a text article like this seekingalpha.com/article/… — Abhinav Arora
2. The training and testing come into picture because I am dealing with a classification problem. In the context of TF-IDF, the training set is the corpus on which we learn the IDF weights for each term. Now when we see those terms in the test set, we calculate the TF from their occurrences in the test set, however we use the IDF of the training set while calculating the TF-IDF weight . This allows us to transform test documents in an online fashion, i.e. as they come to our system. Hope that makes it clear. — Abhinav Arora
@LinMa- I have started the reply :-). I am sorry I was quite busy the past 2 days and could not get the time to write this down. For your follow up questions: 1. Yes, Spacy is a third party NLP library like NLTK. It has lesser functionality than NLTK , but it is very fast. — Abhinav Arora
2. For this application, I used the TF IDF of unigrams to begin with. These TF-IDF vectors are then used for training using a Regularized Logistic Regression. You can also build TF-IDF vectors for bigrams, but that will be useful only if you have a big corpus because as you go to higher order N-grams, you will start getting very sparse feature vectors. If you have a sufficiently large corpus, you can try different N-Grams and see which ones give you a good and consistent Cross-Validation performance. — Abhinav Arora

Abhinav Arora Abhinav Arora · Accepted Answer · 2016-06-22T21:13:49

For TF-IDF feature extraction, scikit-learn has 2 classes TfidfTransformer and TfidfVectorizer. Both these classes essentially serves the same purpose but are supposed to be used differently. For textual feature extraction, scikit-learn has the notion of Transformers and Vectorizers. The Vectorizers directly work on the raw text to generate the features, whereas the Transformer works on existing features and transforms them into the new features. So going by that analogy, TfidfTransformer works on the existing Term-Frequency features and converts them to TF-IDF features, whereas the TfidfVectorizer takes as input the raw text and directly generates the TF-IDF features. You should always use the TfidfVectorizer if at the time of feature building you do not have an existing Document-Term Matrix. At a black box level you should think of the TfidfVectorizer as CountVectorizer followed by a TfidfTransformer.

Now coming to the working example of a Tfidfectorizer. Note that at if this example is clear then you will have no difficulty in understanding the example given for TfidfTransformer.

Now consider you have the following 4 documents in your corpus:

text = [
        'jack and jill went up the hill',
        'to fetch a pail of water',
        'jack fell down and broke his crown',
        'and jill came tumbling after'
       ]

You can use any iterable as long as it iterates over strings. The TfidfVectorizer also supports reading texts from files, about which they have talked in detail in the docs. Now in the simplest case, we can initialize a TfidfVectorizer object and fit our training data to it. This is done as follows:

tfidf = TfidfVectorizer()
train_features = tfidf.fit_transform(text)
train_features.shape

This code simply fits the Vectorizer on our input data and generates a sparse matrix of dimensions 4 x 20. Hence it transforms each document in the given text to a vector of 20 features, where the size of the vocabulary is 20.

In the case of TfidfVectorizer, when we say fit the model, it means that the TfidfVectorizer learns the IDF weights from the corpus. 'Transforming the data' means to use the fitted model (learnt IDF weights) to convert the documents into TF-IDF vectors. This terminology is a standard throughout scikit-learn. It is extremely useful in the case of classification problems. Consider if you want to classify documents as positive or negative based on some labelled training data using TF-IDF vectors as features. In that case you will build your TF-IDF vectorizer using your training data and when you see new test documents, you will simply transform them using the already fitted TfidfVectorizer.

So if we had the following test_txt:

test_text = [
        'jack fetch water',
        'jill fell down the hill'
       ]

we would build test features by simply doing

test_data = tfidf.transform(test_text)

This will again give us a sparse matrix of 2 x 20.The IDF weights used in this case were the ones learnt from the training data.

This is how a simple TfidfVectorizer works. You can make it more intricate by passing more parameters in the constructor. These are very well documented in the Scikit-Learn docs. Some of the parameters, that I use frequently are:

ngram_range - This allows us to build TF-IDF vectors using n gram tokens. For example, if I pass (1,2), then this will build both unigrams and bigrams.
stop_words - Allows us to give stopwords separately to ignore in the process. It is a common practice to filter out words such as 'the', 'of' etc which are present in almost all documents.
min_df and max_df - This allows us to dynamically filter the vocabulary based on the Document Frequency. For example, by giving a max_df of 0.7, I can let my application automatically remove domain specific stop words. For instance, in a corpus of medical journals, the word disease can be thought of as a stop word.

Beyond this, you can also refer to a sample code that I had written for a project. Though it is not well documented but the functions are very well named.

Hope this helps!

TD/IDF in scikit-learn

1 Answers