4
votes

I'm trying to run Spark (1.3.1) Mllib k-means clustering on a dataframe of floating point numbers. I'm following the clustering example provided by Spark

https://spark.apache.org/docs/1.3.1/mllib-clustering.html

However, instead of a text file, I'm using a dataframe composed of one column of doubles (for simplicity). I need to convert this to a vector for the Kmeans function, as per Mllib docs. So far I have this

    import org.apache.spark.mllib.linalg.Vectors
    val parsedData = data.map(s => Vectors.dense(s(0))).cache()

and I receive the error

error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector and
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Any)
val parsedData = sample2.map(s => Vectors.dense(s(1))).cache()
                                          ^

Is there a better way of doing this?

I have read this similar post, but I didn't find it similar enough: How to turn a known structured RDD to Vector and this one How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib which deals with text data

2
So this question exists in the context of Spark running on a Hadoop cluster, where a table was queried using Hive. Hence the dataframe. I suspect this will become an increasingly common scenario as more organisations move to Hadoop.Michael Plazzer

2 Answers

1
votes

What about:

val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(0))).cache()

if data is a dataframe of a single column of doubles this should work. if you have more columns in your dataframe then just add more gets, like:

val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(0),s.getDouble(1))).cache()
1
votes

Since import org.apache.spark.sql.Row can store values of any type its apply method has a following signature:

 def apply(i: Int): Any 

and Vectors.dense expects Double as an argument. There are a couple of ways to handle this. Lets assume you're want to extract values from column x. First of all you can pattern match over a Row constructor:

data.select($"x").map {
    case  Row(x: Double) => Vectors.dense(x)
}

If you prefer positional approach you use pattern matching over the value returned from apply:

data.select($"x").map(row => row(0) match {
    case x: Double => Vectors.dense(x)
})

Finally you can use toDouble method:

data.select($"x").map(r => Vectors.dense(r.getDouble(0)))

select part is optional, but it makes easier to pattern match over a Row and protects you from some naive mistakes like passing a wrong index to get.

If ever want to extract larger number of columns going one-by-one can be cumbersome. In this case something like this could be useful:

data.select($"x", $"y", $"z").map(r => Vectors.dense(
    r.toSeq.map({ case col: Double => col }).toArray)
)