the accuracy of LDA predict for new documents with Spark

Question

I'm work with Mllib of Spark, and now is doing something with LDA.

But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics.

I don't know what caused the result.

Asking for help, and here is my code below:

train:$lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts. predict:

    def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel):        Array[(Long, Vector)] = {
    var docTopicsWeight = new Array[(Long, Vector)](documents.collect().length)
    ldaModel match {
      case localModel: LocalLDAModel =>
        docTopicsWeight = localModel.topicDistributions(documents).collect()
      case distModel: DistributedLDAModel =>
        docTopicsWeight = distModel.toLocal.topicDistributions(documents).collect()
    }
    docTopicsWeight
  }

I assume you're seeing this problem when ldaModel is a DistributedLDAModel. Is that correct? — Jason Lenderman

eliasah eliasah · Accepted Answer · 2015-11-04T09:43:52

I'm not sure if your question actually concerns on why you were getting errors on your code but from I have understand, it seems first that you were using the default Vector class. Secondly, you can't use case class on the model directly. You'll need to use the isInstanceOf and asInstanceOf method for that.

def predict(documents: RDD[(Long, org.apache.spark.mllib.linalg.Vector)], ldaModel: LDAModel): Array[(Long, org.apache.spark.mllib.linalg.Vector)] = {

    var docTopicsWeight = new Array[(Long, org.apache.spark.mllib.linalg.Vector)](documents.collect().length)
    if (ldaModel.isInstanceOf[LocalLDAModel]) {
      docTopicsWeight = ldaModel.asInstanceOf[LocalLDAModel].topicDistributions(documents).collect
    } else if (ldaModel.isInstanceOf[DistributedLDAModel]) {
      docTopicsWeight = ldaModel.asInstanceOf[DistributedLDAModel].toLocal.topicDistributions(documents).collect
    }
    docTopicsWeight

}

the accuracy of LDA predict for new documents with Spark

1 Answers