Converting Dataframe to Vector.dense for k-mean

Question

Following the answer to this question How to convert type Row into Vector to feed to the KMeans

I have created the feature table for my data.(assembler is a Vector Assembler)

val kmeanInput  = assembler.transform(table1).select("features")

When I run k-means with kmeanInput

val clusters = KMeans.train(kmeanInput, numCluster, numIteration)

I get the error

:102: error: type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] val clusters = KMeans.train(kmeanInput, numCluster, numIteration)

As @Jed mentioned in his answer, this happens because rows are not in Vectors.dense format. To solve this I tried

 val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in 
 row["features"]]))

And I get this error

:3: error: ')' expected but '(' found. val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in row["features"]]))

:3: error: ';' expected but ')' found. val dat = kmeanInput.rdd.map(lambda row: Vectors.dense([x for x in row["features"]]))

Alberto Bonsanto Alberto Bonsanto · Accepted Answer · 2017-05-03T22:36:08

You imported the incorrect library, you should use KMeans from ml instead of mllib. The first one uses a DataFrame and the second one uses an RDD.

Converting Dataframe to Vector.dense for k-mean

1 Answers