I have some trouble understanding the LDA topic model result in Spark Mlib.
To my understanding we will get a result like the following:
Topic 1: term1, term2, term....
Topic 2: term1, term2, term3...
...
Topic n: term1, ........
Doc1 : Topic1, Topic2,...
Doc2 : Topic1, Topic2,...
Doc3 : Topic1, Topic2,...
...
Docn :Topic1, Topic2,...
I apply the LDA to the sample data of Spark Mllib which looks like this:
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
Afterwards I get the following results:
topics: org.apache.spark.mllib.linalg.Matrix =
10.33743440804936 9.104197117225599 6.5583684747250395
6.342536927434482 12.486281081997593 10.171181990567925
2.1728012328444692 2.1939589470020042 7.633239820153526
17.858082227094904 9.405347532724434 12.736570240180663
13.226180094790433 3.9570395921153536 7.816780313094214
6.155778858763581 10.224730593556806 5.619490547679611
7.834725138351118 15.52628918346391 7.63898567818497
4.419396221560405 3.072221927676895 2.5083818507627
1.4984991123084432 3.5227422247618927 2.978758662929664
5.696963722524612 7.254625667071781 11.048410610403607
11.080658179168758 10.11489350657456 11.804448314256682
Each column is a term distribution of topics. There are a total of 3 topics and each topic is a distribution of 11 vocabularies.
I think that there are 12 documents, each of which has 11 vocabularies. My trouble is that
- How can I find the topic distribution of each document?
- Why does each topic have a distribution over 11 vocabularies while there are totally 10 different vocabularies (0-9) in the data?
- Why is the sum of each column not equal to 100 (meaning 100% according to my understanding)?