I have a corpus of 250k Dutch news articles 2010-2020 to which I've applied word2vec models to uncover relationships between sets of neutral words and dimensions (e.g. good-bad). Since my aim is also to analyze the prevalence of certain topics over time, I was thinking of using doc2vec instead so as to simultaneously learn word and document embeddings. The 'prevalence' of topics in a document could then be calculated as the cosine similarities between doc vectors and word embeddings (or combinations of word vectors). In this way, I can calculate the annual topical prevalence in the corpus and see whether there's any changes over time. An example of such an approach can be found here.
My issue is that the avg. yearly cosine similarities yield really strange results. As an example, the cosine similarities between document vectors and a mixture of keywords related to covid-19/coronavirus show a decrease in topical prevalence since 2016 (which obviously cannot be the case).
My question is whether the approach that I'm following is actually valid. Or that maybe there's something that I'm missing.
The code that I've written:
# reload doc and w vectors
from gensim.models import KeyedVectors
wordvecs = KeyedVectors.load('/content/drive/MyDrive/doc2_wv.kv')
docvecs = KeyedVectors.load('/content/drive/MyDrive/doc2_docvecs.kv')
# normalize vector
from numpy.linalg import norm
def nrm(x):
return x/norm(x)
# topical prevalence per doc
def topicalprevalence(topic, docvecs, wordvecs):
proj_lst = []
for i in range(0, len(docvecs)):
topic_lst = []
for j in topic:
cossim = nrm(docvecs[i]) @ nrm(wordvecs[j])
topic_lst.append(cossim)
topic_avg = sum(topic_lst) / len(topic_lst)
proj_lst.append(topic_avg)
topicsyrs = {
'topic': proj_lst,
'year': df['datetime'].dt.year
}
return pd.DataFrame(topicsyrs)
# avg topic prevalence per year
def avgtopicyear(topic, docvecs, wordvecs):
docs = topicalprevalence(topic, docvecs, wordvecs)
return pd.DataFrame(docs.groupby("year")["topic"].mean())
# run
covid = ['corona', 'coronapandemie', 'coronacrisis', 'covid', 'pandemie']
covid_scores = topicalprevalence(covid, docvecs, wordvecs)