In R text2vec package - LDA model can show the topic distribution for each tokens in document?

Question

library (text2vec)
library (parallel)
library (doParallel)

N <- parallel::detectCores()
cl <- makeCluster (N)
registerDoParallel (cl)
Ky_young <- read.csv("./Ky_young.csv")

IT <- itoken_parallel (Ky_young$TEXTInfo,
                       ids          = Ky_young$ID,
                       tokenizer    = word_tokenizer,
                       progressbar  = F)

##stopword
stop_words = readLines("./stopwrd1.txt", encoding="UTF-8")

VOCAB <- create_vocabulary (
        IT, stopwords = stop_words
        ngram = c(1, 1)) %>%
        prune_vocabulary (term_count_min = 5)


VoCAB.order <- VOCAB[order((VOCAB$term_count), decreasing = T),]

VECTORIZER <- vocab_vectorizer (VOCAB)

DTM <- create_dtm (IT, VECTORIZER, distributed = F)


LDA_MODEL <- 
      LatentDirichletAllocation$new (n_topics         = 200,
                                     #vocabulary       = VOCAB, <= ERROR
                                     doc_topic_prior  = 0.1,  
                                     topic_word_prior = 0.01) 


##topic-document distribution
LDA_FIT <- LDA_MODEL$fit_transform (
        x = DTM, 
        n_iter = 50, 
        convergence_tol = -1, 
        n_check_convergence = 10)

#topic-word distribution
topic_word_prior = LDA_MODEL$topic_word_distribution

I create the test LDA code in text2vec, and I can get the word-topic distribution and document-topic distribution. (and It was crazy fast)

By the way, I wondering is it possible to get the topic distribution for each tokens in document from text2vec's LDA model?

I understand that LDA analysis process result is each tokens in document belong to specific topics, and so each document has topics distribution.

If I can get the each token's topic distribution, I like to check the each topic's top word changes by classfified documents(like period). Is it possible?

If there are another way, I would be very grateful let me know.

Topic-word assignments are in LDA_MODEL$components. Is it what are you looking for? — Dmitriy Selivanov
If I can match the LDA_MODEL$components result with raw document set, I can find out the each tokens topics in document. I saw the option you said when I tested your package. But I fail to match with raw document set. for example, I try to see the words belong to first~100 document in LDA_MODEL$components result. is it possible? — 유승환
Not sure I understand what are you trying to achieve. Could you provide example (update question)? (not code, just describe you use case) — Dmitriy Selivanov
As I understand,the distribution of topics is due to the terms distributed in the document being assigned to a particular topic. So The distribution of the entire topics is the sum of the terms assigned to that topic.(Is it correct..?) — 유승환
And the LDA model created by the topic modeling analysis targets the entire text that was used for the analysis. I now assume it is a diary text. I split the diary data into a year and write it down in the document title. I want to see the topics distribution by period, but I also want to see the changes in the temrs that make up the topics. — 유승환

Dmitriy Selivanov Dmitriy Selivanov · Accepted Answer · 2017-09-11T18:34:43

Unfortunately it is impossible to get distribution of topics for each token in a given document. Document-topic counts are calculated/aggregated "on the fly", so document-token-topic distribution is not stored anywhere.

In R text2vec package - LDA model can show the topic distribution for each tokens in document?

1 Answers