2
votes

I'm trying to reproduce the results of Graber et al. in showing that when LDA is used with a multilingual corpus, the most probable terms for a topic (say, top 10) will come from a single language. Their paper is here.

This is a reasonable sanity check to perform IMO, but I'm having difficulty.

I'm using the same corpus they used, the Europarl corpus, with the corpus composed of Bulgarian and English. I concatenated the Bulgarian and English corpuses with

cat corpusBg.txt corpusEn.txt >> corpusMixed.txt.  

This contains a sentence on each line, with the collection of lines in Bulgarian and the second collection in English. When I fit an LDA model with 4 topics, 3 contain only English terms in the top 10, and the fourth is mixed between English and Bulgarian. I'm using the default settings for LDA:

texts = [[word for word in doc.lower().split()] for doc in open('corpusMixed.txt', 'r')]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(doc) for doc in texts]
lda = models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics = 4)
topics = lda.print_topics(lda.num_topics)

for t in topics:
    print t

Note that I have not removed stopwords or sparse terms, but I think that this shouldn't matter. There should intuitively be some topics with terms only in bulgarian and others with terms only in English, no?

1

1 Answers

0
votes

In the paper, they use a 10 topic model to discuss the phenomenon. You're using only 4.

When you run LDA with a small number of topics, distinct semantic topics get merged into 'chimera' topics (David Mimno's term, I believe). With only 4 topics for a corpus with "around 60 million words per language" that would be almost inevitable. To be honest, I'm surprised that 10 topics is enough, though I guess LDA would find it hard to merge topics from different languages, as very few word pairs from different languages would appear together in sentences.