MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope?

7

votes

I have read somewhere that MLlib local vectors/matrices are currently wrapping Breeze implementation, but the methods converting MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope. The suggestion to work around this is to write your code in org.apache.spark.mllib.something package.

Is there a better way to do this? Can you cite some relevant examples?

Thanks and regards,

apache-sparkapache-spark-mllibscala-breeze

4

votes

I did the same solution as @dlwh suggested. Here is the code that does it:

package org.apache.spark.mllib.linalg

object VectorPub {

  implicit class VectorPublications(val vector : Vector) extends AnyVal {
    def toBreeze : breeze.linalg.Vector[scala.Double] = vector.toBreeze

  }

  implicit class BreezeVectorPublications(val breezeVector : breeze.linalg.Vector[Double]) extends AnyVal {
    def fromBreeze : Vector = Vectors.fromBreeze(breezeVector)
  }
}

notice that the implicit class extends AnyVal to prevent allocation of a new object when calling those methods

3

votes

My solution is kind of a hybrid of those of @barclar and @lev, above. You don't need to put your code in the org.apache.spark.mllib.linalg if you don't make use of the spark-ml implicit conversions. You can define your own implicit conversions in your own package, like:

package your.package

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.linalg.Vector
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}

object BreezeConverters
{
    implicit def toBreeze( dv: DenseVector ): BDV[Double] =
        new BDV[Double](dv.values)

    implicit def toBreeze( sv: SparseVector ): BSV[Double] =
        new BSV[Double](sv.indices, sv.values, sv.size)

    implicit def toBreeze( v: Vector ): BV[Double] =
        v match {
            case dv: DenseVector => toBreeze(dv)
            case sv: SparseVector => toBreeze(sv)
        }

    implicit def fromBreeze( dv: BDV[Double] ): DenseVector =
        new DenseVector(dv.toArray)

    implicit def fromBreeze( sv: BSV[Double] ): SparseVector =
        new SparseVector(sv.length, sv.index, sv.data)

    implicit def fromBreeze( bv: BV[Double] ): Vector =
        bv match {
            case dv: BDV[Double] => fromBreeze(dv)
            case sv: BSV[Double] => fromBreeze(sv)
        }
}

Then you can import these implicits into your code with:

import your.package.BreezeConverters._

2

votes

As I understand it, the Spark people do not want to expose third party APIs (including Breeze) so that it's easier to change if they decide to move away from them.

You could always put just a simple implicit conversion class in that package and write the rest of your code in your own package. Not much better than just putting everything in there, but it makes it a little more obvious why you're doing it.

1

votes

Here is the best I have so far. Note to @dlwh: please do provide any improvements you might have to this.

The solution I could come up with - that does not put code inside the mllib .linalg package - is to convert each Vector to a new Breeze DenseVector.

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]

1

votes

This solution avoids putting code into Spark's packages and avoids converting sparse to dense vectors:

def toBreeze(vector: Vector) : breeze.linalg.Vector[scala.Double] = vector match {
      case sv: SparseVector => new breeze.linalg.SparseVector[Double](sv.indices, sv.values, sv.size)
      case dv: DenseVector => new breeze.linalg.DenseVector[Double](dv.values)
    }

0

votes

this is a method i wort to convert an Mlib DenceMatrix to a breeze matrix, maybe it help!!

import breeze.linalg._
import org.apache.spark.mllib.linalg.Matrix

def toBreez(X:org.apache.spark.mllib.linalg.Matrix):breeze.linalg.DenseMatrix[Double] = {
var i=0;
var j=0;
val m = breeze.linalg.DenseMatrix.zeros[Double](X.numRows,X.numCols)
for(i <- 0 to X.numRows-1){
  for(j <- 0 to X.numCols-1){
    m(i,j)=X.apply(i, j)
  }
}
m
}

MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope?

6 Answers