First, just in case, I will explain how I represented the documents that I want to run the LDA model on. First, I do some preprocessing to get the most important terms per a person for all their documents, then I get the union of all the most important words.
val text = groupedByPerson.map(s => (s._1,preprocessing.run(s, numWords, stopWords)))
val unionText = text.flatMap(s=> s._2.map(l => l._2)).toSet
I 'tokenize' all the words in all the documents with a regex expressions,
val df: Dataframe = ...
val regexpr = """[a-zA-Z]+""".r
val shaveText = df.select("text").map(row => regexpr.findAllIn(row.getString(0)).toSet)
val unionTextZip = unionText.zipWithIndex.toMap
I also have noticed I need to convert the string 'words' into unique double similar to the example given in the documents before running the LDA model, so I created a map to convert all the words.
val numbersText = shaveText.map(set => set.map(s => unionTextZip(s).toDouble))
Then I create the corpus
val corpus = numbersText.zipWithIndex.map(s => (s._2, Vectors.dense(s._1.toArray))).cache
Now I run the LDA model
val ldaModel = new LDA().setK(3).run(corpus)
When I check the vocabulary size, I notice that it is set to the size of the first document size in the corpus despite there being documents with larger or smaller vocabularies.
Therefore the topic matrix will give an error that looks something like this
Exception in thread "main" java.lang.IndexOutOfBoundsException: (200,0) not in [-31,31) x [-3,3)
at breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
at org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:544)
at org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:541)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:541)
at org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:533)
at application.main.Main$.main(Main.scala:110)
at application.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
I thought i could just use the vector to represent a bag of words. Do the vectors need to be equal size? That is, create a bolean feature for each word whether it is in the document or not?