I tried to replicate your problem but in my case (using a very small corpus), I could not find any difference between the three sums.
I will still share the paths I tried in the case anybody else wants to replicate the problem ;-)
I use some small example from gensim's website and train the three different LDA models:
from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_sym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
chunksize =100000, passes=1, alpha='symmetric')
lda_asym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
chunksize =100000, passes=1, alpha='asymmetric')
lda_auto = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
chunksize =100000, passes=1, alpha='auto')
Now I sum over the topic probabilities for all documents (9 documents in total)
counts = {}
for model in [lda_sym, lda_asym, lda_auto]:
s = 0
for doc_n in range(len(corpus)):
s += pd.DataFrame(lda_sym[corpus[doc_n]])[1].sum()
if s < 1:
print('Sum smaller than 1 for')
print(model, doc_n)
counts[model] = s
And indeed the sums are always 9:
counts = {<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3908>: 9.0,
<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3048>: 9.0,
<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3b70>: 9.0}
Of course that's not a representative example since it's so small. So if you could, maybe provide some more details about your corpus.
In general I would assume that this should always be the case. My first intuition was that maybe empty documents would change the sum, but that is also not the case, since empty documents just yield a topic distribution identical to alpha (which makes sense):
pd.DataFrame(lda_asym[[]])[1]
returns
0 0.203498
1 0.154607
2 0.124657
3 0.104428
4 0.089848
5 0.078840
6 0.070235
7 0.063324
8 0.057651
9 0.052911
which is identical to
lda_asym.alpha
array([ 0.20349777, 0.1546068 , 0.12465746, 0.10442834, 0.08984802,
0.0788403 , 0.07023542, 0.06332404, 0.057651 , 0.05291085])
which also sums to 1.
From a theoretical point of view, choosing different alphas will yield to completely different LDA models.
Alpha is the hyper parameter for the Dirichlet prior. The Dirichlet prior is the distribution from which we draw theta. And theta becomes the parameter that decides what shape the topic distribution is. So essentially, alpha influences how we draw topic distributions. That is why choosing different alphas will also give you slightly different results for
lda.show_topics()
But I do not see why the sum over document probabilities should differ from 1 for any LDA model or any kind of document.