Spark Latent Dirichlet Allocation model topic matrix is too small

Question

First, just in case, I will explain how I represented the documents that I want to run the LDA model on. First, I do some preprocessing to get the most important terms per a person for all their documents, then I get the union of all the most important words.

 val text = groupedByPerson.map(s => (s._1,preprocessing.run(s, numWords, stopWords)))
 val unionText = text.flatMap(s=> s._2.map(l => l._2)).toSet

I 'tokenize' all the words in all the documents with a regex expressions,

val df: Dataframe = ...
val regexpr = """[a-zA-Z]+""".r
val shaveText = df.select("text").map(row => regexpr.findAllIn(row.getString(0)).toSet)
val unionTextZip = unionText.zipWithIndex.toMap

I also have noticed I need to convert the string 'words' into unique double similar to the example given in the documents before running the LDA model, so I created a map to convert all the words.

val numbersText = shaveText.map(set => set.map(s => unionTextZip(s).toDouble))

Then I create the corpus

val corpus = numbersText.zipWithIndex.map(s => (s._2, Vectors.dense(s._1.toArray))).cache

Now I run the LDA model

 val ldaModel = new LDA().setK(3).run(corpus)

When I check the vocabulary size, I notice that it is set to the size of the first document size in the corpus despite there being documents with larger or smaller vocabularies.

Therefore the topic matrix will give an error that looks something like this

Exception in thread "main" java.lang.IndexOutOfBoundsException: (200,0) not in [-31,31) x [-3,3)
    at breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
    at org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:544)
    at org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:541)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:541)
    at org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:533)
    at application.main.Main$.main(Main.scala:110)
    at application.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

I thought i could just use the vector to represent a bag of words. Do the vectors need to be equal size? That is, create a bolean feature for each word whether it is in the document or not?

Jake Fund Jake Fund · Accepted Answer · 2016-04-21T13:56:53

Long story short, of course the vector needs to be of the same length. The obvious answer is to use a sparse vector. I used this and its github link for guidance.

Spark Latent Dirichlet Allocation model topic matrix is too small

1 Answers