I am a beginner in spark,big data and scala, I am trying to build a predictive model in Spark with a sample data-set. I wanted to use pySpark but currently mllib for pyspark has limitations as it doesn't do save and load. I have a couple of questions:
My data is in csv format and looks like this:
Buy,Income,Is Female,Is Married,Has College,Is Professional,Is Retired,Unemployed,Residence Length,Dual Income,Minors,Own,House,White,English,Prev Child Mag,Prev Parent Mag 0,24000,1,0,1,1,0,0,26,0,0,0,1,0,0,0,0 1,75000,1,1,1,1,0,0,15,1,0,1,1,1,1,1,0
Basically this data helps predict whether a user buys this magazine or not based on all the given parameters.
How can I convert this data into a format easily interpreted by Spark? (I have looked at other related answers here about converting csv into RDD and have tried them but it has made me more confused than before)
If I just run the logistic regression program given in the mllib documentation on this data where part of data is used for training and the other for testing, how do I convert it into a demo-able format where I have a new user and the programs walks me through all the parameters and by the end of it gives me a "yes" or "no" on whether this new person will buy the magazine or not.
import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE) // Save and load model model.save(sc, "myModelPath") val sameModel = LinearRegressionModel.load(sc, "myModelPath")
Basically where do I go from here if I use this program as my starting point?