PySpark LDA Model Dense Vector from RDD

Question

I set up my data to feed into the Apache Spark LDA model. The one hangup I'm having is converting the list to a Dense Vector because I have some alphanumeric values in my RDD. The error I receive when trying to run the example code is around converting a string to float.

I understand this error knowing what I know about a dense vector and a float, but there has to be a way to load these string values into an LDA model since this is a topic model.

I should have prefaced this by stating I'm new to Python and Spark so I apologize if I'm misinterpreting something. I'll add my code below. Thank you in advance!

Example

https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

Code:

>>> rdd = rdd5.take(3)
[[u'11394071', u'11052103', u'11052101'], [u'11847272', u'11847272', 
u'11847272', u'11847272', u'11847272', u'11847272', u'11847272', 
u'11847272', u'11847272', u'11847272', u'999999', u'11847272', 
u'11847272', u'11847272', u'11847272', u'11847272', u'11847272', 
u'11847272', u'11847272', u'11847272', u'11847272'], [u'af1lowprm1704', 
u'af1lowprm1704', u'af1lowprm1704', u'af1lowprm1704', u'af1lowprm1704', 
u'am1prm17', u'am1prm17', u'af1highprm1704', u'af1highprm1704']]

>>> parsedData = rdd.map(lambda line: Vectors.dense([float(x) for x in 
line]))
ValueError: could not convert string to float: af1lowprm1704

Next Steps in Code Once Fixed:

# Index Document with Unique ID's
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)

desertnaut desertnaut · Accepted Answer · 2017-08-12T00:06:33

You are indeed misinterpreting the example: the file sample_lda_data.txt does not contain text (check it), but word count vectors that have already been extracted from a corpus. This is indicated in the text preceding the example:

In the following example, we load word count vectors representing a corpus of documents.

So, you need to get these word count vectors first from your own corpus, before proceeding as you try.

PySpark LDA Model Dense Vector from RDD

1 Answers