For TF-IDF feature extraction, scikit-learn has 2 classes TfidfTransformer
and TfidfVectorizer
. Both these classes essentially serves the same purpose but are supposed to be used differently. For textual feature extraction, scikit-learn has the notion of Transformers and Vectorizers. The Vectorizers directly work on the raw text to generate the features, whereas the Transformer works on existing features and transforms them into the new features. So going by that analogy, TfidfTransformer
works on the existing Term-Frequency features and converts them to TF-IDF features, whereas the TfidfVectorizer
takes as input the raw text and directly generates the TF-IDF features. You should always use the TfidfVectorizer
if at the time of feature building you do not have an existing Document-Term Matrix. At a black box level you should think of the TfidfVectorizer
as CountVectorizer
followed by a TfidfTransformer
.
Now coming to the working example of a Tfidfectorizer
. Note that at if this example is clear then you will have no difficulty in understanding the example given for TfidfTransformer
.
Now consider you have the following 4 documents in your corpus:
text = [
'jack and jill went up the hill',
'to fetch a pail of water',
'jack fell down and broke his crown',
'and jill came tumbling after'
]
You can use any iterable
as long as it iterates over strings. The TfidfVectorizer
also supports reading texts from files, about which they have talked in detail in the docs. Now in the simplest case, we can initialize a TfidfVectorizer
object and fit our training data to it. This is done as follows:
tfidf = TfidfVectorizer()
train_features = tfidf.fit_transform(text)
train_features.shape
This code simply fits
the Vectorizer on our input data and generates a sparse matrix of dimensions 4 x 20
. Hence it transforms each document in the given text to a vector of 20
features, where the size of the vocabulary is 20
.
In the case of TfidfVectorizer
, when we say fit the model
, it means that the TfidfVectorizer
learns the IDF weights from the corpus. 'Transforming the data' means to use the fitted model (learnt IDF weights) to convert the documents into TF-IDF vectors. This terminology is a standard throughout scikit-learn. It is extremely useful in the case of classification problems. Consider if you want to classify documents as positive or negative based on some labelled training data using TF-IDF vectors as features. In that case you will build your TF-IDF vectorizer using your training data and when you see new test documents, you will simply transform them using the already fitted TfidfVectorizer
.
So if we had the following test_txt
:
test_text = [
'jack fetch water',
'jill fell down the hill'
]
we would build test features by simply doing
test_data = tfidf.transform(test_text)
This will again give us a sparse matrix of 2 x 20
.The IDF weights used in this case were the ones learnt from the training data.
This is how a simple TfidfVectorizer
works. You can make it more intricate by passing more parameters in the constructor. These are very well documented in the Scikit-Learn docs. Some of the parameters, that I use frequently are:
ngram_range
- This allows us to build TF-IDF vectors using n gram tokens. For example, if I pass (1,2)
, then this will build both unigrams and bigrams.
stop_words
- Allows us to give stopwords separately to ignore in the process. It is a common practice to filter out words such as 'the', 'of' etc which are present in almost all documents.
min_df
and max_df
- This allows us to dynamically filter the vocabulary based on the Document Frequency. For example, by giving a max_df
of 0.7
, I can let my application automatically remove domain specific stop words. For instance, in a corpus of medical journals, the word disease can be thought of as a stop word.
Beyond this, you can also refer to a sample code that I had written for a project. Though it is not well documented but the functions are very well named.
Hope this helps!