3
votes

Using Apache Spark's mllib, I have a Logistic Regression model that I store in HDFS. This Logistic Regression model is trained on historical data coming in from some sensors.

I have another spark program that consumes streaming data from these sensors. I want to be able to use the pre-existing trained model to do predictions on incoming data stream. Note: I don't want my model to be updated by this data.

To load the training model, I'd have to use the following line in my code:

val logisticModel = LogisticRegressionModel.load(sc, <location of model>)

sc: spark context.

However, this application is a streaming application and hence already has a "StreamingContext" setup. Now, from what I've read, it is bad practice to have two contexts in the same program (even though it is possible).

Does this mean that my approach is wrong and I can't do what I'm trying to ?

Also, would it make more sense if I keep storing the stream data in a file and keep running logistic regression on that rather than trying to do it directly in the streaming application ?

1
Were you able to load and use the model in the streaming context? When I use predict function in the streaming context, it throws serialization exception. I am using Spark 1.4.0Srivatsan Nallazhagappan

1 Answers

2
votes

StreamingContext can created in a few ways including two constructors which take an existing SparkContext:

  • StreamingContext(path: String, sparkContext: SparkContext) - where path is a path to a checkpoint file
  • StreamingContext(sparkContext: SparkContext, batchDuration: Duration)

So you can simply create SparkContext, load required models, and create StreamingContext:

val sc: SparkContext = ???
...
val ssc = new StreamingContext(sc, Seconds(1))

You can also get SparkContext using StreamingContext.sparkContext method:

val ssc: StreamingContext  = ???
ssc.sparkContext: SparkContext