2
votes

I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude). My file contains 4 fields separated by comma (the last two are the longitude and latitude).

Here, it's an example of k-means clustering using Spark: https://spark.apache.org/docs/1.1.0/mllib-clustering.html

What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class: train(RDD<Vector> data, int k, int maxIterations)

This is my code:

val data = sc.textFile("/user/test/location/*") val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))

But when I run it in spark-shell I get the following error:

error: overloaded method value dense with alternatives: (values: Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector cannot be applied to (Array[(Double, Double)])

So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?

3

3 Answers

1
votes

There're two 'factory' methods for dense Vectors:

def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector

While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)

val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))

What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)

val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))

Integrating that in the original code should solve the issue:

val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))

(You can of course inline that code, I separated it here to focus on the issue at hand)

1
votes
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))
1
votes

Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.

The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.

If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:

s.split(",").drop(2).take(2).map(_.toDouble)

or if you want all BUT the first to fields converted to Double (if there may be more than 2):

s.split(",").drop(2).map(_.toDouble)