I have a features column which is packaged into a Vector of vectors using Spark's VectorAssembler, as follows. data
is the input DataFrame (of type spark.sql.DataFrame
).
val featureCols = Array("feature_1","feature_2","feature_3")
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)
I am developing a custom Classifier using the Classifier
and ClassificationModel
developer API. ClassificationModel
requires development of a predictRaw()
function which outputs a vector of predicted labels from the model.
def predictRaw(features: FeaturesType) : Vector
This function is set by the API and takes a parameter, features of FeaturesType
and outputs a Vector (which in my case I'm taking to be a Spark DenseVector
as DenseVector
extends the Vector
trait).
Due to the packaging by VectorAssembler, the features
column is of type Vector
and each element is itself a vector, of the original features for each training sample. For example:
features column - of type Vector
[1.0, 2.0, 3.0] - element1, itself a vector
[3.5, 4.5, 5.5] - element2, itself a vector
I need to extract these features into an Array[Double]
in order to implement my predictRaw()
logic. Ideally I would like the following result in order to preserve the cardinality:
`val result: Array[Double] = Array(1.0, 3.5, 2.0, 4.5, 3.0, 4.5)`
i.e. in column-major order as I will turn this into a matrix.
I've tried:
val array = features.toArray // this gives an array of vectors and doesn't work
I've also tried to input the features as a DataFrame object rather than a Vector but the API is expecting a Vector, due to the packaging of the features from VectorAssembler. For example, this function inherently works, but doesn't conform to the API as it's expecting FeaturesType to be Vector as opposed to DataFrame:
def predictRaw(features: DataFrame) :DenseVector = {
val featuresArray: Array[Double] = features.rdd.map(r => r.getAs[Vector](0).toArray).collect
//rest of logic would go here
}
My problem is that features
is of type Vector
, not DataFrame
. The other option might be to package features
as a DataFrame
but I don't know how to do that without using VectorAssembler
.
All suggestions appreciated, thanks! I have looked at Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) but this is in python and I'm using Scala.