My original schema contains a lot of maptypes that I want to use in an ML model so I need to convert them into SparkML sparse vectors.
root
|-- colA: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
|-- colB: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- colC: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Context: SparkML models require the data to be formed as a feature vector. There are some utilities to generates the features vector but none supports the maptype type. e.g. The SparkML VectorAssembler allows to combine several columns (all numeric types, boolean type, or vector type).
Edit:
So far my solution is to explode the map into columns individually then use the VectorAssembler:
val listkeysColA = df.select(explode($"colA"))
.select($"key").as[Int].distinct.collect.sorted
val exploded= df.select(listkeysColA.map(x =>
$"colA".getItem(x).alias(x.toString)): _*).na.fill(0)
val columnNames = exploded.columns
val assembler = new VectorAssembler().setInputCols(columnNames).setOutputCol("features")
Edit2:
I should add that the data in my maps are very sparse and there is no known set of keys beforehand. That is why in my current solution I do one pass over to data first to collect and sort the keys. Then I access the values using getItem(keyName).