I am running Latent Dirichlet Allocation in Spark(LDA). And am trying to understand the output it gives out.
Here is my sample dataset after I carried out the text-feature transform using Tokenizer, StopwordsRemover, CountVectorizer
[Row(Id=u'39', tf_features=SparseVector(1184, {89: 1.0, 98: 2.0, 108: 1.0, 168: 3.0, 210: 1.0, 231: 1.0, 255: 1.0, 290: 1.0, 339: 1.0, 430: 1.0, 552: 1.0, 817: 1.0, 832: 1.0, 836: 1.0, 937: 1.0, 999: 1.0, 1157: 1.0})),
Row(Id=u'7666', tf_features=SparseVector(1184, {15: 2.0, 186: 2.0, 387: 2.0, 429: 2.0, 498: 2.0}))]
AS per Spark's Sparse Vector Representation tf_features stand for: (Vocab_zise,{term_id:term_freq...}
Now I ran the below initial code:
from pyspark.ml.clustering import LDA
lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")
ldaModel=lda.fit(tf_df)
lda_df=ldaModel.transform(tf_df)
First I inspect the resulting transformed data frame.
lda_df.take(3)
Out[73]:
[Row(Id=u'39', tf_features=SparseVector(1184, {89: 1.0, 98: 2.0, 108: 1.0, 168: 3.0, 210: 1.0, 231: 1.0, 255: 1.0, 290: 1.0, 339: 1.0, 430: 1.0, 552: 1.0, 817: 1.0, 832: 1.0, 836: 1.0, 937: 1.0, 999: 1.0, 1157: 1.0}), topicDistribution=DenseVector([0.0049, 0.0045, 0.0041, 0.0048, 0.9612, 0.004, 0.004, 0.0041, 0.0041, 0.0042])),
Row(Id=u'7666', tf_features=SparseVector(1184, {15: 2.0, 186: 2.0, 387: 2.0, 429: 2.0, 498: 2.0}), topicDistribution=DenseVector([0.0094, 0.1973, 0.0079, 0.0092, 0.0082, 0.0077, 0.7365, 0.0078, 0.0079, 0.008])),
Row(Id=u'44', tf_features=SparseVector(1184, {2: 1.0, 9: 1.0, 122: 1.0, 444: 1.0, 520: 1.0, 748: 1.0}), topicDistribution=DenseVector([0.0149, 0.8831, 0.0124, 0.0146, 0.013, 0.0122, 0.0122, 0.0124, 0.0125, 0.0127]))]
My understanding again is that topicDistribution column represents the weights of each topic in that row documents. So it's basically is topics distribution over a documents. Makes sense.
Now I inspect the two methods for LdaModel.
ldaModel.describeTopics().show(2,truncate=False)
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices |termWeights |
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0 |[0, 39, 68, 43, 50, 59, 49, 84, 2, 116]|[0.06362107696025378, 0.012284342954240298, 0.012104887652365797, 0.01066583226047289, 0.01022196994114675, 0.008836060842769776, 0.007638318779273158, 0.006478523079841644, 0.006421040016045976, 0.0057849412030562125]|
|1 |[3, 1, 8, 6, 4, 11, 14, 7, 9, 2] |[0.03164821806301453, 0.031039573066565747, 0.018856890552836778, 0.017520190459705844, 0.017243870770548828, 0.01717645631844006, 0.017147930104624565, 0.01706912474813669, 0.016946362395557312, 0.016722361546119266] |
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows
This seems to show the distribution of words or terms in each topics by their term id. Shows ten terms (can be changed in the method as parameter). Again makes sense.
Second method is below:
In [82]:
ldaModel.topicsMatrix()
Out[82]:
DenseMatrix(1184, 10, [132.7645, 3.0036, 13.3994, 3.6061, 9.3199, 2.4725, 9.3927, 3.4243, ..., 0.5774, 0.8335, 0.49, 0.6366, 0.546, 0.8509, 0.5081, 0.6627], 0)
Now as per docs, it says topicsMatrix is a matrix of Topics and it's terms where topics are columns and terms in that topics are rows. size would be vocab_size X k(no_of_topics).
I don't seem to see that here and not sure what this output mean?.
Secondly, how do I associate these term id back to actual word names. In the end I want a list of topics (as columns or rows whatever) with the top 10-15 words/terms in that so that I can interpret the topics after seeing the kind of words present there. Here I just have some ids and no word names.
Any idea on these two?
Edit II:
When I just do topics[0][1] I get an error as mentioned in comment below.
So I convert it to numpy array like below:
topics.toArray()
Looks like below
array([[ 132.76450545, 2.26966742, 0.73646762, 7.35362275,
0.57789645, 0.58248036, 0.65876465, 0.6695292 ,
0.70034004, 0.63875301],
[ 3.00362754, 68.80842798, 0.48662529, 100.31770907,
0.57867623, 0.5357196 , 0.58895636, 0.83408602,
0.53400242, 0.56291545],
[ 13.39943055, 37.070078
This is a 1184 X 10 array so I am assuming it is a matrix of topics with distribution of words.
If that is the case then the distribution should be probabilities but here we see numbers more than 1 like 132.76 etc. What is this then?