Following this example I've computed TF-IDF weightings for some documents. Now I want to use RowMatrix to calculate document similarities. But I'm having trouble fitting the data into the right format. What I have right now is a DataFrame whose rows have (String,SparseVector) as the two columns' types. I'm supposed to convert this to RDD[Vector] which I thought would be as easy as:
features.map(row => row.getAs[SparseVector](1)).rdd()
But I get this error:
<console>:58: error: Unable to find encoder for type stored in a
Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for
serializing other types will be added in future releases.
Importing spark.implicits._ makes no difference.
So what's going on? I'm surprised that Spark doesn't know how to encode its own vector datatypes.