VectorAssembler output only to DenseVector?

Question

There is something very annoying with the function of VectorAssembler. I am currently transforming a set of columns into a single column of vectors and then use the StandardScaler function to apply the scaling to the included features. However, there seems that SPARK for memory reasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features. But, when you need to use StandardScaler, the input of SparseVector(s) is invalid, only DenseVectors are allowed. Does anybody know a solution to that?

Edit: I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.

max max · Accepted Answer · 2016-07-26T17:28:02

You're right that VectorAssembler chooses dense vs sparse output format based on whichever one uses less memory.

You don't need a UDF to convert from SparseVector to DenseVector; just use toArray() method:

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

Also, StandardScaler accepts SparseVector unless you set withMean=True at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.

VectorAssembler output only to DenseVector?

2 Answers