1
votes

I'm trying to use a saved Mllib model to predict sentiment on live streaming data.

I've tried all the recommendations I have found but still I get errors. Current error :Field "features" does not exist.

The schema of trained data is

root
 |-- label: double (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)

lines = spark\
        .readStream\
        .format("kafka")\
        .option("kafka.bootstrap.servers", bootstrapServers)\
        .option("subscribe", topics)\
        .load()\
        .selectExpr("CAST(value AS STRING)")
    #<class 'pyspark.sql.dataframe.DataFrame'>

read_data=lines.selectExpr("CAST(value AS STRING) as text")

model_nb = NaiveBayesModel.load("./myNBmodel")

prediction = model_nb.transform(read_data)

print(prediction.schema)

query1 = prediction.writeStream \
            .outputMode("update") \
            .foreach(process_row) \
            .start()

query1.awaitTermination()

prediction = model_nb.transform(read_data)

:Py4JJavaError: An error occurred while calling o133.transform. : java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: text

Fetced data don't need features in order to have a prediction, right?

1

1 Answers

0
votes

Fetced data don't need features in order to have a prediction, right?

That's rather incorrect.

Raw data has to be "featured" and that's why you should be using Spark MLlib's ML Pipelines so it's Spark to do this "featurization" not you:

ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow.