Mapping topic back to documents in Spark LDA

Question

I have loaded a number of Reuter news wire articles (1986) into Spark 2.2 and want do some topic learning using LDA

+--------------------+--------------------+----+
|               title|                body|  id|
+--------------------+--------------------+----+
|FED SAYS IT SETS ...|                    |5434|
|MIM TO ACQUIRE ST...|Mount Isa Mines H...|5435|
|MAGNA <MAGAF> CRE...|Magna Internation...|5436|
|J.W. MAYS INC <MA...|Shr 2.27 dlrs vs ...|5437|

I have set up a pipeline

val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, vectorizer, lda))

run the model

val pipelineModel = pipeline.fit(corpus)

I can access the LDA (EM) model using

val ldaModel = pipelineModel.stages(3).asInstanceOf[DistributedLDAModel]

I can see the topics using

ldaModel.describeTopics(maxTermsPerTopic = 5).show()

which after a bit of DF manipulation, it gives topics and their associated terms and probabilities

+-------+----------+--------------------+
|topicId|      term|         probability|
+-------+----------+--------------------+
|      0|   company| 0.08715003585328869|
|      0|      corp| 0.03355461912220357|
|      0|     group|0.024083945559541863|
|      0|      unit|0.016712655949244752|
|      0|     stake| 0.01314416068270042|
|      1|      dlrs|   0.072961342546073|
|      1|      debt| 0.02826491264713813|
...

i want to map the topic distribution back to the original documents. Back in Spark 1.6 to the get the topic distribution for the document (id=5435) above, i would do the following. But topicDistributions is no longer supported.

 ldaModel.topicDistributions.filter(_._1 == 5435).collect

The (Spark ML LDA API) does list two new methods but i am unclear how to use them

 final val topicConcentration: DoubleParam

and final val topicDistributionCol: Param[String]

Had anyone done this?

I meet the same question. The following answer does not work. Do you get the solution? — Muz

user8357511 user8357511 · Accepted Answer · 2017-07-24T11:39:18

1

votes

I could be wrong but it looks like you just want to transform:

ldaModel.transform(corpus)

Mapping topic back to documents in Spark LDA

1 Answers