5
votes

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

Does anyone know of a utility that will let me quickly convert to this format? Thank you.

4
I meet similar problems, do you happen to find the solutions? Thanks.user288609
I have not implemented it yet, but this Python utility was posted to the topic models mailing list and is supposed to take text files and convert them to the correct format.user836015
Thanks a lot, it is very helpful.user288609

4 Answers

3
votes

If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE) 

Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.

So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.

2
votes

The Mallet package from University of Massachusetts Amherst is another option.

And here is an excellent step-by-step demo on how to use Mallet:

You can use mallet with just normal text files as input source.

1
votes

Gensim offers an implementation of Blei's corpus format. See here. You could write a quick corpus based on your CSV file in Python and then save it in lda-c with gensim. It should not be too hard.

0
votes

For Python, there is an available function for this(may not be available at the time of the question).

lda.utils.dtm2ldac

The document is https://pythonhosted.org/lda/api.html#module-lda.utils