0
votes

My goal

I have some news corpus, and I want use LDA to extract keywords for each news document, the keywords are also can be called as labels, indicates that what this news is all about.

Instead of using tf-idf, I searched the Internet and think LDA can do this job better.

Let's define some Terminologies in advance:

  • "term" = "word": an element of the vocabulary
  • "token": instance of a term appearing in a document
  • "topic": multinomial distribution over terms representing some concept
  • "document": one piece of text, corresponding to one row in the input data

What I think about SAPRK LDA

Refer to Spark Doc: https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda and I found that EMLDAOptimizer produces a DistributedLDAModel, which stores not only the inferred topics but also the full training corpus and topic distributions for each document in the training corpus.

A DistributedLDAModel supports:

  • topTopicsPerDocument: The top topics and their weights for each document in the training corpus
  • topDocumentsPerTopic: The top documents for each topic and the corresponding weight of the topic in the documents.

while OnlineLDAOptimizer produces a LocalLDAModel, which only stores the inferred topics.

(from mllib)

Then let's see what I have done

I have some news corpus, and they are pure text, and size is 1.2G more or less. After tokenizing, removing stopwords and all the data cleaning procedures(pre-process). I use CountVectorize with VocabSize set to 200000 and LDA with K set to 15, maxiter to 100, optimizer to "em", CheckpointInterval to 10, other parameters not mentioned are default values. These 2 Transformers are put in a Pipeline for training.

(from ml)

    val countVectorizer = new CountVectorizer()
      .setInputCol("content_clean_cut")
      .setOutputCol("count_vector")
      .setVocabSize(200000)
    val lda = new LDA()
      .setK(15)
      .setMaxIter(100)
      .setFeaturesCol("count_vector")
      .setOptimizer("em")
      .setCheckpointInterval(10)
    val pipeline = new Pipeline()
      .setStages(Array(countVectorizer, lda))
    val ldaModel = pipeline.fit(newsDF)
    ldaModel.write.overwrite().save("./news_lda.model")

Sending the job to spark-submit with about 300G memory, finally it trained successfully.

Then I began to use this pipeline model to transform the pre-processed news corpus, the show() is:

+------------------------------+----+--------------------+--------------------+
|             content_clean_cut| cls|        count_vector|   topicDistribution|
+------------------------------+----+--------------------+--------------------+
|  [深锐, 观察, 科比, 只想, ...|体育|(200000,[4,6,9,11...|[0.02062984049807...|
| [首届, 银联, 网络, 围棋赛,...|体育|(200000,[2,4,7,9,...|[0.02003532045153...|
|[董希源, 国米, 必除, 害群之...|体育|(200000,[2,4,9,11...|[0.00729266918401...|
| [李晓霞, 破茧, 成蝶, 只差,...|体育|(200000,[2,4,7,13...|[0.01200369382233...|
|  [深锐, 观察, 对手, 永远, ...|体育|(200000,[4,9,13,1...|[0.00613485655279...|

schema:

root
 |-- content_clean_cut: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cls: string (nullable = true)
 |-- count_vector: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)

I don't understand what is this topicDistribution column mean, why is its length is K, is that means the index of the largest number is the topic index of this document(news), so we can infer the topic of this document by finding the index of the largest number, and the index actually is the index of the topic in describeTopics() method returned?

Cast the second stage in the pipeline to DistributedLDAModel but failed to find anything related to topTopicsPerDocument and topDocumentsPerTopic. Why is this different from official documents?

And there is this method topicsMatrix in the instance of DistributedLDAModel, what the hell is this? I have done some research, think that topicsMatrix is every topic times every vocab in countVectorizerModel.vocabulary, I don't think this topicsMatrix would help. Besides some of the numbers in this matrix are double greater than 1, and this makes me confused. But this is not important.

What is more important is how to use LDA to extract different keywords for each documents(news)?

Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.

1

1 Answers

1
votes

K is the number of topics to be clustered with your news corpus. The topicDistribution of each document is the array of K-topics probability (basically to tell you which topic index has the highest probability). You would then required to manually label the K-topics (based on the terms grouped under each topic), hence you are able to "label" the documents.

LDA not going to give you a "label" based on the text, instead it clustered the related keywords into the desired k-topics