1
votes

I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database.

This algorithm takes the training set as LabeledPoint. Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem?

Here you can see the piece of code related to this issue :

val labeled = table.map{ row => 
    var s = row.toSeq.toArray           
    s = s.map(el => if (el != null) el.toString.toDouble)
    LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
    }

And the error that I get:

error   : type mismatch;
found   : Any
required: Double

Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue?

1

1 Answers

2
votes

Some reasons why this code cannot work:

  • Row.toSeq is of type () => Seq[Any] and so is s
  • since you cover only not null case el => if (el != null) el.toString.toDouble is of type T => AnyVal (where T is any). If el is null it returns Unit
  • even if it wasn't you assign it to var of type Seq[Any] this is exactly what you get. One way or another it is not a valid input for Vectors.dense
  • Row.apply is of type Int => Any so the output cannot be used as a label

Should work but have no effect:

  • s.take(0)

May stop working in Spark 2.0

  • map over DataFrame - not much we can do about it now since Vector class has no encoder available.

How you can approach this:

  • either filter complete rows or fill missing values for example using DataFrameNaFunctions:

      // You definitely want something smarter than that
      val fixed = df.na.fill(0.0)
      // or
      val filtered = df.na.drop
    
  • use VectorAssembler to build vectors:

    import org.apache.spark.ml.feature.VectorAssembler
    
    val assembler = new VectorAssembler()
      .setInputCols(df.columns.tail)
      .setOutputCol("features")
    
    val assembled = assembler.transform(fixed)
    
  • convert to LabledPoint

    import org.apache.spark.mllib.regression.LabeledPoint  
    
    
    // Assuming lable column is called label
    
    assembled.select($"label", $"features").rdd.map {
      case Row(label: Double, features: Vector) => 
        LabeledPoint(label, features)
    }