Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

Question

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

Does anyone know of a utility that will let me quickly convert to this format? Thank you.

I meet similar problems, do you happen to find the solutions? Thanks. — user288609
I have not implemented it yet, but this Python utility was posted to the topic models mailing list and is supposed to take text files and convert them to the correct format. — user836015

Ben Ben · Accepted Answer · 2012-12-07T01:39:46

If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.

So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.

Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

4 Answers