0
votes

I am working on the Enron dataset to classify emails and using Python 3. I have pre-processed the data (tokenizing, removing stop words, stemming) and currently working on representing the data in transactional and data-matrix format. This is my understanding of the process:

  1. Find tf-idf for every word in every document.
  2. Sort the words based on tfidf scores.
  3. Get top "k" words based on score.
  4. Iterate through corpus and find intersection of top "k" words with words in every document. Print list of top "k" words in every document to get data in transactional form.
  5. Representing the presence/absence (1/0) of top "k" words in each document represents data in data matrix form.

Consider following 3 documents:

  • doc1: The quick fox jumped over the quick dog;
  • doc2: The quick fox jumped;
  • doc3: The dog was lazy;

tfidf calculation:

tf("quick", doc1) = 2; 
tf("quick", doc2) = 1; 
idf("quick") = log(3/2) = 0.176; 
tfidf("quick", doc1) = 2*0.176 = 0.352; 
tfidf("quick", doc2) = 1*0.176 = 0.176; 

tf("lazy", doc3) = 1;
idf("lazy") = log(3/1) = 0.477;
tfidf("lazy", doc3) = 1*0.477 = 0.477;

tf("fox", doc1) = 1; 
tf("fox", doc2) = 1; 
idf("fox") = log(3/2) = 0.176; 
tfidf("fox", doc1) = 1*0.176 = 0.176; 
tfidf("fox", doc2) = 1*0.176 = 0.176; 

tf("dog", doc1) = 1; 
tf("dog", doc3) = 1; 
idf("dog") = log(3/2) = 0.176; 
tfidf("dog", doc1) = 1*0.176 = 0.176; 
tfidf("dog", doc3) = 1*0.176 = 0.176; 

So, if the above words were to be sorted, their rank would be as follows:

lazy (0.477), quick (0.352), quick (0.176), fox(0.176), fox(0.176), dog(0.176), dog(0.176).

Questions:

  1. Based on above calculation, what are the top 4 words? Is it for the overall corpus, or the top word in every document?
  2. Is the sorting of the words correct?
  3. Suppose the top 4 words are: lazy, quick, quick, fox;
transactional form is:  
doc1: quick, fox, quick 
doc2: quick, fox
doc3: lazy

data-matrix form is: 
doc1: 1,1,0,0,1,0 (quick, fox, jump, over, quick, dog) 
doc2: 1,1,0 (quick, fox, jump) 
doc3: 0,1 (dog, lazy)

Above forms will change if the top 4 words were to be: lazy, quick, fox, dog. Is my understanding correct?

2
I have answered your question. If it helps you, accept it as the answer and give an upvote. thanks.Wasi Ahmad

2 Answers

1
votes

For your first question, as tf-idf is used to rank relevance of documents to search strings, you would really be looking for the "top documents", i.e. documents where your search words rank the highest overall. So, you need to turn your calculation on its head and calculate the rank of each document. After that, chances are you may not need to even worry about your second and third questions because the documents' ranks will likely be different so you can just take the highest-ranked one as the answer. And - not to forget - you need to have a starting string against which you will be evaluating documents.

1
votes
  1. Based on above calculation, what are the top 4 words? Is it for the overall corpus, or the top word in every document?

When you are selecting top k words, it becomes the controlled vocabulary (text mining term) for your corpus. I encourage you to go through this tutorial. Few important points:

  • When you are selecting top k words from the entire corpus, you are actually considering ttf-idf where ttf mean total term frequency. When you consider one single document and compute a term's frequency, we call it TF. When we do the same for the whole corpus, it becomes TTF.

For your example:

Unique words are: The, quick, fox, jumped, over, the, dog, was, lazy

I encourage you before pre-process your data, convert them to either upper or lower case. Then The and the will be same!

If you do that, then unique words are: The, quick, fox, jumped, over, dog, was, lazy

Total unique words: 8

Term frequencies for each unique words are:

The = 2,1,1 | quick = 2,1,0 | fox = 1,1,0 | jumped = 1,1,0
over = 1,0,0 | dog = 1,0,1 | was = 0,0,1 | lazy = 0,0,1

Total words in the corpus: 8 + 4 + 4 = 16

Total term frequency (TTF) and document frequency (DF) for unique words are:

The = 4, 3 | quick = 3, 2 | fox = 2, 2 | jumped = 2, 2
over = 1, 1 | dog = 2, 2 | was = 1, 1 | lazy = 1, 1

If we just follow a simple definition of inverted document frequency (IDF) as IDF = Log(total documents in corpus / DF), then TTF-IDF weight (we actually call them TF as well) of each words become:

The = 4 * log(3/3) = 4 * 0 = 0
quick = 3 * log(3/2) = 3 * 0.18 = 0.54
fox = 2 * log(3/2) = 2 * 0.18 = 0.36
jumped = 2 * log(3/2) = 2 * 0.18 = 0.36
over = 1 * log(3/1) = 1 * 0.48 = 0.48
dog = 2 * log(3/2) = 2 * 0.18 = 0.36
was = 1 * log(3/1) = 1 * 0.48 = 0.48
lazy = 1 * log(3/1) = 1 * 0.48 = 0.48

So, the top 4 words should be: qucik, over, was, lazy. During computing tf-idf weight, you can give different weight to tf or idf. keep this in mind, you are not selecting top 4 words for each document but from entire corpus. That's why total term frequency is used instead of term frequency. By the way, when you are consider a whole corpus, term freqeucny and total term frequency terms are used interchangeably.

  1. Is the sorting of the words correct?

Sorting is correct. Once you compute tf-idf weight score for each unique terms (we call it dictionary terms in text mining), just sort them in descending order in pick top k. You should pick the words with higher tf-idf weight. If your idea is not clear about TF and IDF, i encourage you to read this Wikipedia article.

  1. Suppose the top 4 words are: lazy, quick, quick, fox; Above forms will change if the top 4 words were to be: lazy, quick, fox, dog. Is my understanding correct?

Answer of your question is Yes because your controlled vocabulary is changed, so as your document respresentation will be. Once you select the top k words, assign them an index value. Then you need to put 1 if a particular word from vocabulary appears in a document, otherwise 0. You can also use the Term-Frequency instead of just putting 1.

Note that, your data matrix is wrong as you have selected top 4 words as controlled vocabulary, the length of each document representation should be 4 as well. So, for example if our controlled vocabulary is: qucik, over, was, lazy, then the document representation should look like as below.

doc1: 1 1 0 0 ['was', 'lazy' missing]
doc2: 1 0 0 0 ['over', 'was', 'lazy' missing]
doc3: 0 0 1 1 ['was', 'lazy' missing]

You can generate the same using Term-Frequency. Just put term-frequency (respect to individual documents) instead of 1. For example, representation for document 1 will look like: 2, 1, 0, 0 ['quick' appears twice].

Remember to follow a particular sequence of the controlled vocabulary terms. Thats why i said, give an index number to each controlled vocabulary terms. For example, in the examples i provided, i used: quick = 0, over = 1, was = 2, lazy = 3.

One more thing, I want to inform you that the way you are following to represent a document, is called Bag-of-Words representation. Its very interesting and i encourage you to read documentation on it.

Hopefully, my answer will help you.