I'm trying to reproduce the results of Graber et al. in showing that when LDA is used with a multilingual corpus, the most probable terms for a topic (say, top 10) will come from a single language. Their paper is here.
This is a reasonable sanity check to perform IMO, but I'm having difficulty.
I'm using the same corpus they used, the Europarl corpus, with the corpus composed of Bulgarian and English. I concatenated the Bulgarian and English corpuses with
cat corpusBg.txt corpusEn.txt >> corpusMixed.txt.
This contains a sentence on each line, with the collection of lines in Bulgarian and the second collection in English. When I fit an LDA model with 4 topics, 3 contain only English terms in the top 10, and the fourth is mixed between English and Bulgarian. I'm using the default settings for LDA:
texts = [[word for word in doc.lower().split()] for doc in open('corpusMixed.txt', 'r')]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(doc) for doc in texts]
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics = 4)
topics = lda.print_topics(lda.num_topics)
for t in topics:
print t
Note that I have not removed stopwords or sparse terms, but I think that this shouldn't matter. There should intuitively be some topics with terms only in bulgarian and others with terms only in English, no?