5
votes

I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.

The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.

So for example, I have executed (as part of my exploration) in spark-shell

val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))

which if 'merged' will look like this matrix

1.0  0.0  3.0
0.0  2.5  0.0
1.5  1.8  0.0

So, how do I transform Vectors v0, v1, v2 to rows?

1
val rows = sc.parallelize(Seq(v0, v1, v2))zero323

1 Answers

8
votes

By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))

val rows = sc.parallelize(Seq(v0, v1, v2))

val mat: RowMatrix = new RowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()