I'm trying to classify the IT support tickets into relevant topics using LDA in R.
My corpus has: 5,550 documents and 1882 terms. I started with 12,000 terms, but after removing common stop words and other noise words I've landed with 1800 odd words.
Upon examination of the LDAvis output the results/topics returned by the algorithm are pretty good which I've verified by checking a sample of the corpus. My words in the output are exclusive to the topics and once can arrive at the topic at the first reading
But On checking the document- topic probability matrix, the probability assigned in the matrix is very low in majority of the cases(ideally it should be high as the topics we’re getting are good).
I've already tried the following- trying different no of topics, increase iterations but nothing has helped till now.
If I increase the number of terms in the corpus( not removing some of the words), then I end up with a bad representation of topics
My Code and the LDA parameters are:
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 29 ### established by log of likelihood function
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed,
best=best, burnin = burnin, iter = iter, thin=thin,keep=keep))
Str of the LDA output is:
..@ seedwords : NULL
..@ z : int [1:111776] 12 29 3 27 11 12 14 12 12 24 ...
..@ alpha : num 1.72
..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
..@ delta : num 0.1
..@ iter : int 500
..@ thin : int 500
..@ burnin : int 4000
..@ initialize : chr "random"
..@ alpha : num 1.72
..@ seed : int [1:5] 2003 5 63 100001 765
..@ verbose : int 0
..@ prefix : chr
..@ save : int 0
..@ nstart : int 5
..@ best : logi TRUE
..@ keep : int 0
..@ estimate.beta: logi TRUE
..@ k : int 29
..@ terms : chr [1:1882] "–auto""| __truncated__ "–block""|
..@ documents : chr [1:5522] "1" "2" "3" "4" ...
..@ beta : num [1:29, 1:1882] -10.7 -10.6 -10.6 -10.5 -10.6 ...
..@ gamma : num [1:5522, 1:29] 0.0313 0.025 0.0236 0.0287 0.0287
..@ wordassignments:List of 5
..$ i : int [1:73447] 1 1 1 1 1 2 2 2 2 2 ...
..$ j : int [1:73447] 175 325 409 689 1185 169 284 316 331 478 ...
..$ v : num [1:73447] 12 29 3 27 4 12 12 12 3 3 ...
..$ nrow: int 5522
..$ ncol: int 1882
..- attr(*, "class")= chr "simple_triplet_matrix"
..@ loglikelihood : num -408027
..@ iter : int 500
..@ logLiks : num(0)
..@ n : int 111776
Can anyone guide me on how to improve the document- topic probability or if there’s something we can do to improve the algorithm.
Disclaimer: I' doing LDA for the first time, so I would really appreciate if you could give me some sources where to find the required info