19
votes

First let's extract the TF-IDF scores per term per document:

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

Printing it out:

for doc in corpus_tfidf:
    print doc

[out]:

[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? I.e.

>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
...     for word, score in doc:
...         tfidf_saliency[word] += score / len(corpus_tfidf)
... 
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})

Looking at the output, could we assume that the most "prominent" word in the corpus is:

>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'

If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?

5
Could you please append the output of your dictionary to your Question. I want do comparsion with my dictionary so I can update my output Table in my Answer.stovfl
Whoops, sorry i didn't save it. The dictionary would be different because I was using Python3 and the dictionary isn't the same if i re-run it. But the rank of the words should be deterministic since it's based on static counts, rerun the gensim code and you should get the same "system, survey, graph" as the top 3.alvas
Sorry, could Not use gensim.stovfl

5 Answers

6
votes

The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.

Find the Top Words in corpus_tfidf.

    topWords = {}
    for doc in corpus_tfidf:
        for iWord, tf_idf in doc:
            if iWord not in topWords:
                topWords[iWord] = 0

            if tf_idf > topWords[iWord]:
                topWords[iWord] = tf_idf

    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
        if i == 6: break

Output comparison cart:
NOTE: Could'n use gensim, to create a matching dictionary with corpus_tfidf.
Can only display Word Indizies.

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  
---------------------------------------------------------------------------  
1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  
2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  
3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  
4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  
5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  
6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815  

The calculation of TF-IDF takes always the corpus in account.

Tested with Python:3.4.2

2
votes

There is two context that saliency can be calculated in them.

  1. saliency in the corpus
  2. saliency in a single document

saliency in the corpus simply can be calculated by counting the appearances of particular word in corpus or by inverse of the counting of the documents that word appears in (IDF=Inverted Document Frequency). Because the words that hold the specific meaning does not appear in everywhere.

saliency in the document is calculated by tf_idf. Because that is composed of two kinds of information. Global information (corpus-based) and local information (document-based).Claiming that "the word with larger in-document frequency is more important in current document" is not completely true or false because it depends on the global saliency of word. In a particular document you have many words like "it, is, am, are ,..." with large frequencies. But these word is not important in any document and you can take them as stop words!

---- edit ---

The denominator (=len(corpus_tfidf)) is a constant value and could be ignored if you want to deal with ordinality rather than cardinality of measurement. On the other hand we know that IDF means Inverted Document Freqeuncy so IDF can be reoresented by 1/DF. We know that the DF is a corpus level value and TF is document level-value. TF-IDF Summation turns document-level TF into Corpus-level TF. Indeed the summation is equal to this formula:

count ( word ) / count ( documents contain word)

This measurement can be called inverse-scattering value. When the value goes up means the words is gathered into smaller subset of documents and vice versa.

I believe that this formula is not so useful.

2
votes

This is a great discussion. Thanks for starting this thread. The idea of including document length by @avip seems interesting. Will have to experiment and check on the results. In the meantime let me try asking the question a little differently. What are we trying to interpret when querying for TF-IDF relevance scores ?

  1. Possibly trying to understand the word relevance at the document level
  2. Possibly trying to understand the word relevance per Class
  3. Possibly trying to understand the word relevance overall ( in the whole corpus )

     # # features, corpus = 6 documents of length 3
     counts = [[3, 0, 1],
               [2, 0, 0],
               [3, 0, 0],
               [4, 0, 0],
               [3, 2, 0],
               [3, 0, 2]]
     from sklearn.feature_extraction.text import TfidfTransformer
     transformer = TfidfTransformer(smooth_idf=False)
     tfidf = transformer.fit_transform(counts)
     print(tfidf.toarray())
    
     # lambda for basic stat computation
     summarizer_default = lambda x: np.sum(x, axis=0)
     summarizer_mean = lambda x: np.mean(x, axis=0)
    
     print(summarizer_default(tfidf))
     print(summarizer_mean(tfidf))
    

Result:

# Result post computing TF-IDF relevance scores
array([[ 0.81940995,  0.        ,  0.57320793],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 0.47330339,  0.88089948,  0.        ],
           [ 0.58149261,  0.        ,  0.81355169]])

# Result post aggregation (Sum, Mean) 
[[ 4.87420595  0.88089948  1.38675962]]
[[ 0.81236766  0.14681658  0.2311266 ]]

If we observe closely, we realize the the feature1 witch occurred in all the document is not ignored completely because the sklearn implementation of idf = log [ n / df(d, t) ] + 1. +1 is added so that the important word which just so happens to occur in all document is not ignored. E.g. the word 'bike' occurring very frequently in classifying a particular document as 'motorcyle' (20_newsgroup dataset).

  1. Now in-regards to the first 2 questions, one is trying to interpret and understand the top common features that might be occurring in the document. In that case, aggregating in some form including all possible occurrence of the word in a doc is not taking anything away even mathematically. IMO such a query is very useful exploring the dataset and helping understanding what the dataset is about. The logic might be applied to vectorizing using Hashing as well.

    relevance_score = mean(tf(t,d) * idf(t,d)) = mean( (bias + inital_wt * F(t,d) / max{F(t',d)}) * log(N/df(d, t)) + 1 ))

  2. Question3 is very important as it might as well be contributing to features being selected for building a predictive model. Just using TF-IDF scores independently for feature selection might be misleading at multiple level. Adopting a more theoretical statistical test such as 'chi2' couple with TF-IDF relevance scores might be a better approach. Such statistical test also evaluates the importance of the feature in relation to the respective target class.

And of-course combining such interpretation with the model's learned feature weights would be very helpful in understanding the importance of text derived features completely.

** The problem is a little more complex to cover in detail here. But, hoping the above helps. What do others feel ?

Reference: https://arxiv.org/abs/1707.05261

0
votes

I stumbled across the same problem somehow. I will share my solution here but don't really know how effective it is.

After calculating tf-idf basically what we have is like a matrix of terms vs documents.

[terms/docs : doc1  ,  doc2 , doc3..... docn
 term1      : tf(doc1)-idf, tf(doc2)-idf , tf(doc3)-idf.....
 .
 .
 .
 termn ........ ]

We can think of columns doc1,doc2...docn as scores given to every term according n different metrics. If we sum across the columns we are simply averaging the scores which is a naive way and does not completely represent the information captured. We can do something better as this is a top-k retrieval problem. One efficient algorithm is Fagin's algorithm and works on this idea :

The sorted lists are scanned until k data items are found which have been seen in all the lists, then the algorithm can stop and it is guaranteed that among all the data items seen so far, even those which were not present in all the lists, the top-k data items can be found.

Sorted lists here simply mean a single column of a particular doc becomes a list and we have n such lists. So sort each one of them and then do fagins on it.

Read about it more here

-1
votes

If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?

If you summed td-idf scores across documents, terms that would otherwise have low scores might get a boost and terms with higher scores might have their scores depressed.

I don't think simply dividing by the total number of documents will be sufficient normalization to address this. Maybe incorporating document length into the normalization factor? Either way, I think all such methods would still need to be adjusted per domain.

So, generally speaking, mathematically I expect you would get an undesirable averaging effect.