I have successfully trained an LDA model in spark, via the Python API:
from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)
This works completely fine, but I now need the document-topic matrix for the LDA model, but as far as I can tell all I can get is the word-topic, using model.topicsMatrix()
.
Is there some way to get the document-topic matrix from the LDA model, and if not, is there an alternative method (other than implementing LDA from scratch) in Spark to run an LDA model that will give me the result I need?
EDIT:
After digging around a bit, I found the documentation for DistributedLDAModel in the Java api, which has a topicDistributions()
that I think is just what I need here (but I'm 100% sure if the LDAModel in Pyspark is in fact a DistributedLDAModel under the hood...).
In any case, I am able to indirectly call this method like so, without any overt failures:
In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480
But if I actually look at the results, all I get are string telling me that the result is actually a Scala tuple (I think):
In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'}]
Maybe this is generally the right approach, but is there way to get the actual results?