I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M]
is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared
in the document. Note that [term_1]
is an integer which indexes the
term; it is not a string.
Does anyone know of a utility that will let me quickly convert to this format? Thank you.