0
votes
library (text2vec)
library (parallel)
library (doParallel)

N <- parallel::detectCores()
cl <- makeCluster (N)
registerDoParallel (cl)
Ky_young <- read.csv("./Ky_young.csv")

IT <- itoken_parallel (Ky_young$TEXTInfo,
                       ids          = Ky_young$ID,
                       tokenizer    = word_tokenizer,
                       progressbar  = F)

##stopword
stop_words = readLines("./stopwrd1.txt", encoding="UTF-8")

VOCAB <- create_vocabulary (
        IT, stopwords = stop_words
        ngram = c(1, 1)) %>%
        prune_vocabulary (term_count_min = 5)


VoCAB.order <- VOCAB[order((VOCAB$term_count), decreasing = T),]

VECTORIZER <- vocab_vectorizer (VOCAB)

DTM <- create_dtm (IT, VECTORIZER, distributed = F)


LDA_MODEL <- 
      LatentDirichletAllocation$new (n_topics         = 200,
                                     #vocabulary       = VOCAB, <= ERROR
                                     doc_topic_prior  = 0.1,  
                                     topic_word_prior = 0.01) 


##topic-document distribution
LDA_FIT <- LDA_MODEL$fit_transform (
        x = DTM, 
        n_iter = 50, 
        convergence_tol = -1, 
        n_check_convergence = 10)

#topic-word distribution
topic_word_prior = LDA_MODEL$topic_word_distribution

I create the test LDA code in text2vec, and I can get the word-topic distribution and document-topic distribution. (and It was crazy fast)

By the way, I wondering is it possible to get the topic distribution for each tokens in document from text2vec's LDA model?

I understand that LDA analysis process result is each tokens in document belong to specific topics, and so each document has topics distribution.

If I can get the each token's topic distribution, I like to check the each topic's top word changes by classfified documents(like period). Is it possible?

If there are another way, I would be very grateful let me know.

1
Topic-word assignments are in LDA_MODEL$components. Is it what are you looking for?Dmitriy Selivanov
If I can match the LDA_MODEL$components result with raw document set, I can find out the each tokens topics in document. I saw the option you said when I tested your package. But I fail to match with raw document set. for example, I try to see the words belong to first~100 document in LDA_MODEL$components result. is it possible?유승환
Not sure I understand what are you trying to achieve. Could you provide example (update question)? (not code, just describe you use case)Dmitriy Selivanov
As I understand,the distribution of topics is due to the terms distributed in the document being assigned to a particular topic. So The distribution of the entire topics is the sum of the terms assigned to that topic.(Is it correct..?)유승환
And the LDA model created by the topic modeling analysis targets the entire text that was used for the analysis. I now assume it is a diary text. I split the diary data into a year and write it down in the document title. I want to see the topics distribution by period, but I also want to see the changes in the temrs that make up the topics.유승환

1 Answers

1
votes

Unfortunately it is impossible to get distribution of topics for each token in a given document. Document-topic counts are calculated/aggregated "on the fly", so document-token-topic distribution is not stored anywhere.