I am working on the Enron dataset to classify emails and using Python 3. I have pre-processed the data (tokenizing, removing stop words, stemming) and currently working on representing the data in transactional and data-matrix format. This is my understanding of the process:
- Find tf-idf for every word in every document.
- Sort the words based on tfidf scores.
- Get top "k" words based on score.
- Iterate through corpus and find intersection of top "k" words with words in every document. Print list of top "k" words in every document to get data in transactional form.
- Representing the presence/absence (1/0) of top "k" words in each document represents data in data matrix form.
Consider following 3 documents:
- doc1: The quick fox jumped over the quick dog;
- doc2: The quick fox jumped;
- doc3: The dog was lazy;
tfidf calculation:
tf("quick", doc1) = 2;
tf("quick", doc2) = 1;
idf("quick") = log(3/2) = 0.176;
tfidf("quick", doc1) = 2*0.176 = 0.352;
tfidf("quick", doc2) = 1*0.176 = 0.176;
tf("lazy", doc3) = 1;
idf("lazy") = log(3/1) = 0.477;
tfidf("lazy", doc3) = 1*0.477 = 0.477;
tf("fox", doc1) = 1;
tf("fox", doc2) = 1;
idf("fox") = log(3/2) = 0.176;
tfidf("fox", doc1) = 1*0.176 = 0.176;
tfidf("fox", doc2) = 1*0.176 = 0.176;
tf("dog", doc1) = 1;
tf("dog", doc3) = 1;
idf("dog") = log(3/2) = 0.176;
tfidf("dog", doc1) = 1*0.176 = 0.176;
tfidf("dog", doc3) = 1*0.176 = 0.176;
So, if the above words were to be sorted, their rank would be as follows:
lazy (0.477), quick (0.352), quick (0.176), fox(0.176), fox(0.176), dog(0.176), dog(0.176).
Questions:
- Based on above calculation, what are the top 4 words? Is it for the overall corpus, or the top word in every document?
- Is the sorting of the words correct?
- Suppose the top 4 words are: lazy, quick, quick, fox;
transactional form is: doc1: quick, fox, quick doc2: quick, fox doc3: lazy data-matrix form is: doc1: 1,1,0,0,1,0 (quick, fox, jump, over, quick, dog) doc2: 1,1,0 (quick, fox, jump) doc3: 0,1 (dog, lazy)
Above forms will change if the top 4 words were to be: lazy, quick, fox, dog. Is my understanding correct?