I set up my data to feed into the Apache Spark LDA model. The one hangup I'm having is converting the list to a Dense Vector because I have some alphanumeric values in my RDD. The error I receive when trying to run the example code is around converting a string to float.
I understand this error knowing what I know about a dense vector and a float, but there has to be a way to load these string values into an LDA model since this is a topic model.
I should have prefaced this by stating I'm new to Python and Spark so I apologize if I'm misinterpreting something. I'll add my code below. Thank you in advance!
Example
https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
Code:
>>> rdd = rdd5.take(3)
[[u'11394071', u'11052103', u'11052101'], [u'11847272', u'11847272',
u'11847272', u'11847272', u'11847272', u'11847272', u'11847272',
u'11847272', u'11847272', u'11847272', u'999999', u'11847272',
u'11847272', u'11847272', u'11847272', u'11847272', u'11847272',
u'11847272', u'11847272', u'11847272', u'11847272'], [u'af1lowprm1704',
u'af1lowprm1704', u'af1lowprm1704', u'af1lowprm1704', u'af1lowprm1704',
u'am1prm17', u'am1prm17', u'af1highprm1704', u'af1highprm1704']]
>>> parsedData = rdd.map(lambda line: Vectors.dense([float(x) for x in
line]))
ValueError: could not convert string to float: af1lowprm1704
Next Steps in Code Once Fixed:
# Index Document with Unique ID's
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)