16
votes

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

This works completely fine, but I now need the document-topic matrix for the LDA model, but as far as I can tell all I can get is the word-topic, using model.topicsMatrix().

Is there some way to get the document-topic matrix from the LDA model, and if not, is there an alternative method (other than implementing LDA from scratch) in Spark to run an LDA model that will give me the result I need?

EDIT:

After digging around a bit, I found the documentation for DistributedLDAModel in the Java api, which has a topicDistributions() that I think is just what I need here (but I'm 100% sure if the LDAModel in Pyspark is in fact a DistributedLDAModel under the hood...).

In any case, I am able to indirectly call this method like so, without any overt failures:

In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480

But if I actually look at the results, all I get are string telling me that the result is actually a Scala tuple (I think):

In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'}]

Maybe this is generally the right approach, but is there way to get the actual results?

3
I know that the LDA functionality in Spark is still in development, but it seems bizarre that there's no straightforward way of getting this info out of the model...moustachio
I think there is an another issue here. As pointed to me by Jason Lenderman Spark LDA doesn't implement LSA but a variant of PLSI. It makes this matrices less useful directly. See also stackoverflow.com/a/32953813/1560062zero323
I see, but in that case a more or less equivalent solution would be to predict topics for the original training documents similar to the method described in the linked question, but as far as I can tell the necessary methods aren't implemented in the Python API. Are they hidden somewhere, or is there some other way of achieving this in Pyspark?moustachio
As far as I can tell it is not accessible from Python.zero323
is there an answer to this question with pyspark 2.0.0?Hardik Gupta

3 Answers

6
votes

After extensive research, this is definitely not possible via the Python api on the current version of Spark (1.5.1). But in Scala, it's fairly straightforward (given an RDD documents on which to train):

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}

// first generate RDD of documents...

val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)

# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]

Then getting the document topic distributions is as simple as:

distLDAModel.topicDistributions
5
votes

The following extends the above response for PySpark and Spark 2.0.

I hope you'll excuse me for posting this as a reply instead of as a comment, but I lack the rep at the moment.

I am assuming that you have a trained LDA model made from a corpus like so:

lda = LDA(k=NUM_TOPICS, optimizer="em")
ldaModel = lda.fit(corpus) # Where corpus is a dataframe with 'features'.

To convert a document into a topic distribution, we create a dataframe of the document ID and a vector (sparse is better) of the words.

documents = spark.createDataFrame([
    [123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
    [2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
], schema=["id", "features"]
transformed = ldaModel.transform(documents)
dist = transformed.take(1)
# dist[0]['topicDistribution'] is now a dense vector of our topics.
3
votes

As of Spark 2.0 you can use transform() as a method from pyspark.ml.clustering.DistributedLDAModel. I just tried this on the 20 newsgroups dataset from scikit-learn and it works. See the returned vectors which is a distribution on topics for a document.

>>> test_results = ldaModel.transform(wordVecs)
Row(filename='/home/jovyan/work/data/20news_home/20news-bydate-test/rec.autos/103343', target=7, text='I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.', tokens=['little', 'confused', 'models', 'bonnevilles', 'someone', 'differences', 'features', 'performance', 'curious', 'prefereably', 'usually', 'demand', 'spring', 'summer'], vectors=SparseVector(10977, {28: 1.0, 29: 1.0, 152: 1.0, 301: 1.0, 496: 1.0, 552: 1.0, 571: 1.0, 839: 1.0, 1114: 1.0, 1281: 1.0, 1288: 1.0, 1624: 1.0}), topicDistribution=DenseVector([0.0462, 0.0538, 0.045, 0.0473, 0.0545, 0.0487, 0.0529, 0.0535, 0.0467, 0.0549, 0.051, 0.0466, 0.045, 0.0487, 0.0482, 0.0509, 0.054, 0.0472, 0.0547, 0.0501]))