0
votes

I'm trying to classify the IT support tickets into relevant topics using LDA in R.

My corpus has: 5,550 documents and 1882 terms. I started with 12,000 terms, but after removing common stop words and other noise words I've landed with 1800 odd words.

Upon examination of the LDAvis output the results/topics returned by the algorithm are pretty good which I've verified by checking a sample of the corpus. My words in the output are exclusive to the topics and once can arrive at the topic at the first reading

But On checking the document- topic probability matrix, the probability assigned in the matrix is very low in majority of the cases(ideally it should be high as the topics we’re getting are good).

I've already tried the following- trying different no of topics, increase iterations but nothing has helped till now.

If I increase the number of terms in the corpus( not removing some of the words), then I end up with a bad representation of topics

My Code and the LDA parameters are:

burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 29 ### established by log of likelihood function

ldaOut <-LDA(dtm,k, method="Gibbs", 
         control=list(nstart=nstart, seed = seed,
                      best=best, burnin = burnin, iter = iter, thin=thin,keep=keep))

Str of the LDA output is:

..@ seedwords      : NULL
..@ z              : int [1:111776] 12 29 3 27 11 12 14 12 12 24 ...
..@ alpha          : num 1.72
..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
..@ delta        : num 0.1
..@ iter         : int 500
..@ thin         : int 500
..@ burnin       : int 4000
..@ initialize   : chr "random"
..@ alpha        : num 1.72
..@ seed         : int [1:5] 2003 5 63 100001 765
..@ verbose      : int 0
..@ prefix       : chr 
..@ save         : int 0
..@ nstart       : int 5
..@ best         : logi TRUE
..@ keep         : int 0
..@ estimate.beta: logi TRUE
..@ k              : int 29
..@ terms          : chr [1:1882] "–auto""| __truncated__ "–block""|   
..@ documents      : chr [1:5522] "1" "2" "3" "4" ...
..@ beta           : num [1:29, 1:1882] -10.7 -10.6 -10.6 -10.5 -10.6 ...
..@ gamma          : num [1:5522, 1:29] 0.0313 0.025 0.0236 0.0287 0.0287 
..@ wordassignments:List of 5
..$ i   : int [1:73447] 1 1 1 1 1 2 2 2 2 2 ...
..$ j   : int [1:73447] 175 325 409 689 1185 169 284 316 331 478 ...
..$ v   : num [1:73447] 12 29 3 27 4 12 12 12 3 3 ...
..$ nrow: int 5522
..$ ncol: int 1882
..- attr(*, "class")= chr "simple_triplet_matrix"
..@ loglikelihood  : num -408027
..@ iter           : int 500
..@ logLiks        : num(0) 
..@ n              : int 111776

Can anyone guide me on how to improve the document- topic probability or if there’s something we can do to improve the algorithm.

Disclaimer: I' doing LDA for the first time, so I would really appreciate if you could give me some sources where to find the required info

1
This is a question about data analysis, not programming. You should ask such questions over at Cross Validated or so me other more appropriate site.MrFlick

1 Answers

0
votes

Why do you need large probabilities? If you have a large dictionary, you may get very small probability values from LDA and that's fine. As far as the ranking of words in each topic are different, you are very much in the race to get good topic models. If you are interested to learn topic models from very basic, i encourage you to see this slide (http://www.cs.virginia.edu/~hw5x/Course/CS6501-Text-Mining/_site/docs/topic%20models.pptx).