Data mining: Representing data in transactional/data matrix form

Question

I am working on the Enron dataset to classify emails and using Python 3. I have pre-processed the data (tokenizing, removing stop words, stemming) and currently working on representing the data in transactional and data-matrix format. This is my understanding of the process:

Find tf-idf for every word in every document.
Sort the words based on tfidf scores.
Get top "k" words based on score.
Iterate through corpus and find intersection of top "k" words with words in every document. Print list of top "k" words in every document to get data in transactional form.
Representing the presence/absence (1/0) of top "k" words in each document represents data in data matrix form.

Consider following 3 documents:

doc1: The quick fox jumped over the quick dog;
doc2: The quick fox jumped;
doc3: The dog was lazy;

tfidf calculation:

tf("quick", doc1) = 2; 
tf("quick", doc2) = 1; 
idf("quick") = log(3/2) = 0.176; 
tfidf("quick", doc1) = 2*0.176 = 0.352; 
tfidf("quick", doc2) = 1*0.176 = 0.176; 

tf("lazy", doc3) = 1;
idf("lazy") = log(3/1) = 0.477;
tfidf("lazy", doc3) = 1*0.477 = 0.477;

tf("fox", doc1) = 1; 
tf("fox", doc2) = 1; 
idf("fox") = log(3/2) = 0.176; 
tfidf("fox", doc1) = 1*0.176 = 0.176; 
tfidf("fox", doc2) = 1*0.176 = 0.176; 

tf("dog", doc1) = 1; 
tf("dog", doc3) = 1; 
idf("dog") = log(3/2) = 0.176; 
tfidf("dog", doc1) = 1*0.176 = 0.176; 
tfidf("dog", doc3) = 1*0.176 = 0.176;

So, if the above words were to be sorted, their rank would be as follows:

lazy (0.477), quick (0.352), quick (0.176), fox(0.176), fox(0.176), dog(0.176), dog(0.176).

Questions:

Based on above calculation, what are the top 4 words? Is it for the overall corpus, or the top word in every document?
Is the sorting of the words correct?
Suppose the top 4 words are: lazy, quick, quick, fox;

transactional form is:  
doc1: quick, fox, quick 
doc2: quick, fox
doc3: lazy

data-matrix form is: 
doc1: 1,1,0,0,1,0 (quick, fox, jump, over, quick, dog) 
doc2: 1,1,0 (quick, fox, jump) 
doc3: 0,1 (dog, lazy)

Above forms will change if the top 4 words were to be: lazy, quick, fox, dog. Is my understanding correct?

I have answered your question. If it helps you, accept it as the answer and give an upvote. thanks. — Wasi Ahmad

postoronnim postoronnim · Accepted Answer · 2016-12-09T22:05:18

For your first question, as tf-idf is used to rank relevance of documents to search strings, you would really be looking for the "top documents", i.e. documents where your search words rank the highest overall. So, you need to turn your calculation on its head and calculate the rank of each document. After that, chances are you may not need to even worry about your second and third questions because the documents' ranks will likely be different so you can just take the highest-ranked one as the answer. And - not to forget - you need to have a starting string against which you will be evaluating documents.

Data mining: Representing data in transactional/data matrix form

Consider following 3 documents:

tfidf calculation:

Questions:

2 Answers