I have an RDD in Spark where the objects are based on a case class:
ExampleCaseClass(user: User, stuff: Stuff)
I want to use Spark's ML pipeline, so I convert this to a Spark data frame. As part of the pipeline, I want to transform one of the columns into a column whose entries are vectors. Since I want the length of that vector to vary with the model, it should be built into the pipeline as part of the feature transformation.
So I attempted to define a Transformer as follows:
class MyTransformer extends Transformer {
val uid = ""
val num: IntParam = new IntParam(this, "", "")
def setNum(value: Int): this.type = set(num, value)
setDefault(num -> 50)
def transform(df: DataFrame): DataFrame = {
...
}
def transformSchema(schema: StructType): StructType = {
val inputFields = schema.fields
StructType(inputFields :+ StructField("colName", ???, true))
}
def copy (extra: ParamMap): Transformer = defaultCopy(extra)
}
How do I specify the DataType of the resulting field (i.e. fill in the ???)? It will be a Vector of some simple class (Boolean, Int, Double, etc). It seems VectorUDT might have worked, but that's private to Spark. Since any RDD can be converted to a DataFrame, any case class can be converted to a custom DataType. However I can't figure out how to manually do this conversion, otherwise I could apply it to some simple case class wrapping the vector.
Furthermore, if I specify a vector type for the column, will VectorAssembler correctly process the vector into separate features when I go to fit the model?
Still new to Spark and especially to the ML Pipeline, so appreciate any advice.
Vectors
in Spark support onlyDouble
type (these are not the same as ScalaVector,
see stackoverflow.com/q/31255756/1560062) 2)VectorUDT
is not private 3) Not every RDD can converted to a DataFrame or at least not directly and not every case class can be automatically used asdDataset
(Frame
) element. – zero323RDD
s can be converted toDataFrame
s. Dirty RDD – Alberto Bonsanto