2
votes

Relatively new to scala and the Spark API kit but I have a question trying to make use of the vector assembler

http://spark.apache.org/docs/latest/ml-features.html#vectorassembler

to then make use of matrix correlations

https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations

The dataframe column is of dtype linalg.Vector

val assembler = new VectorAssembler()

val trainwlabels3 = assembler.transform(trainwlabels2)

trainwlabels3.dtypes(0)

res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7)

and yet calling this to an RDD for the statistics tool throws a mismatch error.

val data: RDD[Vector] = sc.parallelize(
  trainwlabels3("features")
) 

<console>:80: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Seq[org.apache.spark.mllib.linalg.Vector]

Thanks in advance for any help.

2

2 Answers

1
votes

You should just select:

val features = trainwlabels3.select($"features")

Convert to RDD

 val featuresRDD = features.rdd

and map:

featuresRDD.map(_.getAs[Vector]("features"))
0
votes

This should work for you:

val rddForStatistics = new VectorAssembler()
   .transform(trainwlabels2)
   .select($"features")
   .as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
   .rdd

However, you should avoid RDDs and figure out how to do what you want with the DataFrame-based API (in the spark.ml package) because working with RDDs is all but deprecated in MLlib.