I am trying to run LDA. I am not applying it to words and documents, but error messages and error-cause. each row is an error and each column is an error cause. A cell is 1 if error cause was active, and 0 if error cause was not active. Now I am trying to get for each created topic (here equivalent to a error pattern) the error-cause names (not just the index). The code I have until now and that seems to work is the following
# VectorAssembler combines all columns into one vector
assembler = VectorAssembler(
inputCols=list(set(df.columns) - {'error_ID'}),
outputCol="features")
lda_input = assembler.transform(df)
# Train LDA model
lda = LDA(k=5, maxIter=10, featuresCol= "features")
model = lda.fit(lda_input)
# A model with higher log-likelihood and lower perplexity is considered to be good.
ll = model.logLikelihood(lda_input)
lp = model.logPerplexity(lda_input)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
# Describe topics.
topics = model.describeTopics(7)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
# Shows the result
transformed = model.transform(lda_input)
print(transformed.show(truncate=False))
My outputs are:
Based on https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda I added that part, which doesn't work:
topics = model.topicsMatrix()
for topic in range(10):
print("Topic " + str(topic) + ":")
for word in range(0, model.vocabSize()):
print(" " + str(topics[word][topic]))
How do I now get the top error-causes / find the columns corresponding to the term indices?