I am trying to get the average TF-IDF value of a word in an entire corpus. Suppose we have the word 'stack' appear 4 times in our corpus(a couple of hundred documents). It has these values 0.34, 0.45, 0.68, 0.78
in the 4 documents it was found. Hence, it's average TF-IDF value across the entire corpus is 0.5625
. How can I find this for all the words in the document?
I am using a scikit-learn implementation of TF-IDF. This is the code I am using to get the TF-IDF values for each document:
for i in docs_test:
feature_names=cv.get_feature_names()
doc=docs_test[itr]
itr += 1
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
sorted_items=sort_coo(tf_idf_vector.tocoo())
#Extracting the top 81 keywords along with their TF-IDF scores
keywords=extract_topn_from_vector(feature_names,sorted_items,81)
For each iteration, this outputs a dictionary of 81 words and their TF-IDF score for that document:
{'kerry': 0.396, 'paris': 0.278, 'france': 0.252 ......}
Since I am only outputting the top 81 words, I know that all the words in that document won't be covered. So, I want the average TF-IDF value of each of the top 81 words in a document (Words will be repeated).
EDIT: I tried out @mijjiga's solution. Here are the results:
{'the': 0.51203095036175, 'to': 0.36268858983957286, 'of': 0.3200193439760937, 'in': 0.256015475180875, 'he': 0.2133462293173958}
{'the': 0.5076730825668095, 'to': 0.3299875036684262, 'in': 0.3299875036684262, 'and': 0.30460384954008574, 'trump': 0.17768557889838335}
{'the': 0.5257856140532874, 'children': 0.292103118918493, 'to': 0.2336824951347944, 'winton': 0.2336824951347944, 'of': 0.2336824951347944}
{'the': 0.6082672845890075, 'to': 0.3146210092701763, 'trump': 0.2936462753188312, 'that': 0.23911196704533397, 'of': 0.21394228630371986}
{'the': 0.6285692218670833, 'to': 0.3610929572427925, 'of': 0.2139810116994326, 'that': 0.20060719846821806, 'iran': 0.18723338523700353}
{'the': 0.5730922466510651, 'clinton': 0.29578954665861423, 'of': 0.24032900666012408, 'in': 0.2218421599939607, 'that': 0.2218421599939607}
{'the': 0.7509270472649924, 'to': 0.34926839407674065, 'trump': 0.17463419703837033, 'of': 0.17463419703837033, 'delegates': 0.1571707773345333}
{'on': 0.4, 'administration': 0.2, 'through': 0.2, 'the': 0.2, 'tax': 0.2}
{'the': 0.5885277950982455, 'in': 0.3184973949943446, 'of': 0.3046496821685035, 'to': 0.29080196934266245, 'women': 0.2769542565168214}
As we can see, the word 'the' has multiple values. I apologise if my question was not indicative of this but I want one value for each word. And this value is the average TF-IDF score for that word in that corpus of documents. Any help as to how to get that working? Thanks!
Here is the code used:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
itr = 0
for i in range(1,10):
docs=docs_test[itr]
docs=[docs]
itr+=1
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)
tfidf = tfidf_vectorizer_vectors.todense()
# TFIDF of words not in the doc will be 0, so replace them with nan
tfidf[tfidf == 0] = np.nan
# Use nanmean of numpy which will ignore nan while calculating the mean
means = np.nanmean(tfidf, axis=0)
# convert it into a dictionary for later lookup
means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))
tfidf = tfidf_vectorizer_vectors.todense()
# Argsort the full TFIDF dense vector
ordered = np.argsort(tfidf*-1)
words = tfidf_vectorizer.get_feature_names()
top_k = 5
for i, doc in enumerate(docs):
result = { }
# Pick top_k from each argsorted matrix for each doc
for t in range(top_k):
# Pick the top k word, find its average tfidf from the
# precomputed dictionary using nanmean and save it to later use
result[words[ordered[i,t]]] = means[words[ordered[i,t]]]
print (result )
docs=docs_test[itr]
There should be no loop,docs
should contain all the documents in the corpus. Something likedocs=docs_test
– mujjiga