Convert a Spark dataframe to a vector

Question

I want to predict the output class of a Spark dataframe using a naive classifier model. I use the structured streaming functionality of Spark 2.1.0.

When I try to do that:

tokenizer = Tokenizer(inputCol="message", outputCol="logTokenize")
tokenizeData = tokenizer.transform(stream_df)

hashingTF = HashingTF(inputCol="logTokenize", outputCol="rawFeatures", numFeatures = 1000)
featurizedData = hashingTF.transform(tokenizeData)
stream_df = featurizedData.select("rawFeatures")

path = "/tmp/NaiveClassifier"
naive_classifier_model = NaiveBayesModel.load(spark.sparkContext, path)

predictions = naive_classifier_model.predict(stream_df)

I got the following error message:

TypeError: Cannot convert type <class 'pyspark.sql.dataframe.DataFrame'> into Vector

stream_df is a Spark dataframe and I want to get a dataframe with rawFeatures and the predicted classes columns.

Suresh Suresh · Accepted Answer · 2017-07-24T17:39:14

Use pyspark.ml.feature.VectorAssembler to transform to a vector,

from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=['rawFeatures'], outputCol="features")
stream_df = vecAssembler.transform(featurizedData)

Also, you are using Tokenzier,Hasing TF transformers. So, I believe you can use an ML pipeline to fit all the transfomers together.

It is just a suggestion. Have a look.

Convert a Spark dataframe to a vector

2 Answers