8
votes

I'm just starting out with ML and Apache Spark, so I've been trying out Linear Regression based on the Spark examples. I can't seem to generate a proper model for any data except the sample in the example, and the intercept is always 0.0, regardless of the input data.

I've prepared a simple training data-set based on the function:

y = (2*x1) + (3*x2) + 4

i.e. I would expect the intercept to be 4 and the weights to be (2, 3).

If I run the LinearRegressionWithSGD.train(...) on the raw data, the model is:

Model intercept: 0.0, weights: [NaN,NaN]

And the predictions are all NaN:

Features: [1.0,1.0], Predicted: NaN, Actual: 9.0
Features: [1.0,2.0], Predicted: NaN, Actual: 12.0

etc

If I scale the data first, I get:

Model intercept: 0.0, weights: [17.407863391511754,2.463212481736855]

Features: [1.0,1.0], Predicted: 19.871075873248607, Actual: 9.0
Features: [1.0,2.0], Predicted: 22.334288354985464, Actual: 12.0
Features: [1.0,3.0], Predicted: 24.797500836722318, Actual: 15.0

etc

Either I'm doing something wrong, or I don't understand what the output from this model should be, so can anybody suggest where I might be going wrong here?

My code is below:

   // Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
   // i.e. intercept should be 4, weights (2, 3)?
   val data = sc.textFile("data/dummydata.txt")

   // LabeledPoint is (label, [features])
   val parsedData = data.map { line =>
    val parts = line.split(',')
    val label = parts(0).toDouble
    val features = Array(parts(1), parts(2)) map (_.toDouble)
    LabeledPoint(label, Vectors.dense(features))
  }

  // Scale the features
  val scaler = new StandardScaler(withMean = true, withStd = true)
                   .fit(parsedData.map(x => x.features))
  val scaledData = parsedData
                  .map(x => 
                  LabeledPoint(x.label, 
                     scaler.transform(Vectors.dense(x.features.toArray))))

  // Building the model: SGD = stochastic gradient descent
  val numIterations = 1000
  val step = 0.2
  val model = LinearRegressionWithSGD.train(scaledData, numIterations, step)

  println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")`

  // Evaluate model on training examples
  val valuesAndPreds = scaledData.map { point =>
    val prediction = model.predict(point.features)
    (point.label, point.features, prediction)
  }
  // Print out features, actual and predicted values...
  valuesAndPreds.take(10).foreach({case (v, f, p) => 
      println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")})
2
For PySpark, if anyone wonders, this would be model = LinearRegressionWithSGD.train(res, intercept = True)David Arenburg

2 Answers

11
votes

@Noah: Thanks - your advice prompted me to look at this again, and I found some example code here that allows you to generate the intercept and also set other params, such as number of iterations, via the optimiser.

Here's my revised model generation code, which seems to work OK on my dummy data:

  // Building the model: SGD = stochastic gradient descent:
  // Need to setIntercept = true, and seems only to work with scaled data 
  val numIterations = 600
  val stepSize = 0.1
  val algorithm = new LinearRegressionWithSGD()
  algorithm.setIntercept(true)
  algorithm.optimizer
    .setNumIterations(numIterations)
    .setStepSize(stepSize)

  val model = algorithm.run(scaledData)

It still seems to need scaled data, rather than the raw data, as input, but that's OK for my purposes here.

9
votes

The train method you're using is a shortcut that set's the intercept to zero and doesn't try to find one. If you use the underlying class you can get a nonzero intercept:

val model = new LinearRegressionWithSGD(step, numIterations, 1.0).
    setIntercept(true).
    run(scaledData)

Should give you an intercept now.