2
votes

I am a beginner in spark,big data and scala, I am trying to build a predictive model in Spark with a sample data-set. I wanted to use pySpark but currently mllib for pyspark has limitations as it doesn't do save and load. I have a couple of questions:

  1. My data is in csv format and looks like this:

    Buy,Income,Is Female,Is Married,Has College,Is Professional,Is Retired,Unemployed,Residence Length,Dual Income,Minors,Own,House,White,English,Prev Child Mag,Prev Parent Mag
    0,24000,1,0,1,1,0,0,26,0,0,0,1,0,0,0,0
    1,75000,1,1,1,1,0,0,15,1,0,1,1,1,1,1,0
    

Basically this data helps predict whether a user buys this magazine or not based on all the given parameters.

How can I convert this data into a format easily interpreted by Spark? (I have looked at other related answers here about converting csv into RDD and have tried them but it has made me more confused than before)

  1. If I just run the logistic regression program given in the mllib documentation on this data where part of data is used for training and the other for testing, how do I convert it into a demo-able format where I have a new user and the programs walks me through all the parameters and by the end of it gives me a "yes" or "no" on whether this new person will buy the magazine or not.

    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.mllib.regression.LinearRegressionModel
    import org.apache.spark.mllib.regression.LinearRegressionWithSGD
    import org.apache.spark.mllib.linalg.Vectors
    
    // Load and parse the data
    val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
    val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }.cache()
    
    // Building the model
    val numIterations = 100
    val model = LinearRegressionWithSGD.train(parsedData, numIterations)
    
    // Evaluate model on training examples and compute training error
    val valuesAndPreds = parsedData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
    println("training Mean Squared Error = " + MSE)
    
    // Save and load model
    model.save(sc, "myModelPath")
    val sameModel = LinearRegressionModel.load(sc, "myModelPath")
    

Basically where do I go from here if I use this program as my starting point?

1
@maasg can you help answer this question? I have read a lot of your replies to spark related questions and they have helped a lot. thanks!serendipity
It is not easy but you have to increase the dimensionality of your data by converting all categorical variables into new columnsAlberto Bonsanto

1 Answers

0
votes

Having that model you could predict if input data satisfy model (1) or don't (0). To do so:

    val yourInputData = //putYouDataHere
    val res = model.predict(Vectors.dense(yourInputData))
    println(res)

Vector that you've passed to predict method should have same number of dimensions as in data that was used to conctruct model that is in : "data/mllib/ridge-data/lpsa.data"