3
votes

I have some trouble understanding the LDA topic model result in Spark Mlib.

To my understanding we will get a result like the following:

 Topic 1: term1, term2, term....
 Topic 2: term1, term2, term3...
 ...
 Topic n: term1, ........

 Doc1 : Topic1, Topic2,...
 Doc2 : Topic1, Topic2,...
 Doc3 : Topic1, Topic2,...
 ...
 Docn :Topic1, Topic2,...

I apply the LDA to the sample data of Spark Mllib which looks like this:

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

Afterwards I get the following results:

topics: org.apache.spark.mllib.linalg.Matrix = 

10.33743440804936   9.104197117225599   6.5583684747250395  
6.342536927434482   12.486281081997593  10.171181990567925  
2.1728012328444692  2.1939589470020042  7.633239820153526   
17.858082227094904  9.405347532724434   12.736570240180663  
13.226180094790433  3.9570395921153536  7.816780313094214   
6.155778858763581   10.224730593556806  5.619490547679611   
7.834725138351118   15.52628918346391   7.63898567818497    
4.419396221560405   3.072221927676895   2.5083818507627     
1.4984991123084432  3.5227422247618927  2.978758662929664   
5.696963722524612   7.254625667071781   11.048410610403607  
11.080658179168758  10.11489350657456   11.804448314256682  

Each column is a term distribution of topics. There are a total of 3 topics and each topic is a distribution of 11 vocabularies.

I think that there are 12 documents, each of which has 11 vocabularies. My trouble is that

  • How can I find the topic distribution of each document?
  • Why does each topic have a distribution over 11 vocabularies while there are totally 10 different vocabularies (0-9) in the data?
  • Why is the sum of each column not equal to 100 (meaning 100% according to my understanding)?
2

2 Answers

3
votes

You can get the topic distribution over each document by calling DistributedLDAModel.topicDistributions() or DistributedLDAModel.javaTopicDistributions() in Spark 1.4. This will only work if your model optimizer is set to EMLDAOptimizer (the default).

There is an example here and the documentation here.

It looks something like this in Java:

LDAModel ldaModel = lda.setK(k.intValue()).run(corpus);
JavaPairRDD<Long,Vector> topic_dist_over_docs = ((DistributedLDAModel) ldaModel).javaTopicDistributions();

As for the second question:

The LDA model returns a probability distribution over each word in the vocabulary for each topic. So, you have three topics (three columns) each with 11 rows (one for each word in the vocab) because the vocab size is 11.

1
votes

Why is the sum of each column not equal to 100 (I mean 100% according to my understanding)

  1. Use describeTopics method for getting distributions of topic over words(vocabs).

  2. Sum of probabilities of each vocabs might be 1.0 ( almost, but it couldn't be exact 1.0 )

Example codes in java:

    Tuple2<int[], double[]>[] topicDesces = ldaModel.describeTopics();
    int topicCount = topicDesces.length;

    for( int t=0; t<topicCount; t++ ){

        Tuple2<int[], double[]> topic = topicDesces[t];
        System.out.print("Topic " + t + ":");

        int[] indices = topic._1();
        double[] values = topic._2();
        double sum = 0.0d;
        int wordCount = indices.length;

        for( int w=0; w<wordCount; w++ ){

            double prob = values[w];
            System.out.format("\t%d:%f", indices[w] , prob);
            sum += prob;
        }
        System.out.println( "(" + sum + ")");
    }