In Spark, how to convert DataFrame with SparseVector into RDD[Vector]?

Question

Following this example I've computed TF-IDF weightings for some documents. Now I want to use RowMatrix to calculate document similarities. But I'm having trouble fitting the data into the right format. What I have right now is a DataFrame whose rows have (String,SparseVector) as the two columns' types. I'm supposed to convert this to RDD[Vector] which I thought would be as easy as:

features.map(row => row.getAs[SparseVector](1)).rdd()

But I get this error:

<console>:58: error: Unable to find encoder for type stored in a
Dataset.  Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._  Support for 
serializing other types will be added in future releases.

Importing spark.implicits._ makes no difference.

So what's going on? I'm surprised that Spark doesn't know how to encode its own vector datatypes.

Alper t. Turker Alper t. Turker · Accepted Answer · 2017-10-11T22:07:52

Just convert to RDD before map.

import org.apache.spark.ml.linalg._

val df = Seq((1, Vectors.sparse(1, Array(), Array()))).toDF

df.rdd.map(row => row.getAs[Vector](1))

In Spark, how to convert DataFrame with SparseVector into RDD[Vector]?

1 Answers