4
votes

I have written the following code in order to convert SQL DataFrame df to RDD[LabeledPoint]:

val targetInd = df.columns.indexOf("myTarget")
val ignored = List("myTarget")
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))

df.printSchema

val dfLP = df.rdd.map(r => LabeledPoint(
  r.getDouble(targetInd),
  Vectors.dense(featInd.map(r.getDouble(_)).toArray)
))

The schema looks like this:

root
 |-- myTarget: long (nullable = true)
 |-- var1: long (nullable = true)
 |-- var2: double (nullable = true)

When I run dfLP.foreach(l => l.label), then the following error occurs:

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double

How can I cast the label to double? I expect that other features might be both double or long, isn't it? If it's not true, then I will also need to cast the rest of features to double.

1
You could do either r.getLong(targetInd).toDouble inside the map or df.withColumn("myTarget", df("myTarget").cast("double")) before it. Note that it should be done for each Long column. - Daniel de Paula
@DanieldePaula: Thanks. I tried this val dfLP = df.rdd.map(r => LabeledPoint( r.getDouble(targetInd).toDouble, Vectors.dense(featInd.map(r.getDouble(_).toDouble).toArray) )), but still the same error when I do dfLP.foreach(l => l.label) - duckertito
You should do r.getLong() on long columns. Alternatively, if all columns are numeric, you can cast all of them to double before mapping (you could use foldLeft on df.columns, and call withColumn inside, for instance). - Daniel de Paula
@DanieldePaula: To test your second solution, could you please tell me how to cast the rest of features (their indices are stored in featInd)? df=df.withColumn("myTarget", df("myTarget").cast("double")) - duckertito
@DanieldePaula: yes, all my columns are numeric. I would prefer to cast them all to double as you say by using foldLeft on df.columns. Couls you please post the detailed answer so that I can check it and accept it? - duckertito

1 Answers

5
votes

You could try casting all columns to double before mapping. Using foldLeft should do the trick:

df.columns.foldLeft(df) { 
  (newDF, colName) => newDF.withColumn(colName, df(colName).cast("double")) 
}