I'm trying to convert a csv-string that actually contains double values into a spark-ml compatible dataset. Since I don't know the number of features to be expected beforehand, I decided to use a helper class "Instance", that already contains the right datatypes to be used by the classifiers and that is working as intended in some other cases already:
public class Instance implements Serializable {
/**
*
*/
private static final long serialVersionUID = 6091606543088855593L;
private Vector indexedFeatures;
private double indexedLabel;
...getters and setters for both fields...
}
The part, where I get the unexpected behaviour is this one:
Encoder<Instance> encoder = Encoders.bean(Instance.class);
System.out.println("encoder.schema()");
encoder.schema().printTreeString();
Dataset<Instance> dfInstance = df.select("value").as(Encoders.STRING())
.flatMap(s -> {
String[] splitted = s.split(",");
int length = splitted.length;
double[] features = new double[length-1];
for (int i=0; i<length-1; i++) {
features[i] = Double.parseDouble(splitted[i]);
}
if (length < 2) {
return Collections.emptyIterator();
} else {
return Collections.singleton(new Instance(
Vectors.dense(features),
Double.parseDouble(splitted[length-1])
)).iterator();
}
}, encoder);
System.out.println("dfInstance");
dfInstance.printSchema();
dfInstance.show(5);
And I get the following output on the console:
encoder.schema()
root
|-- indexedFeatures: vector (nullable = true)
|-- indexedLabel: double (nullable = false)
dfInstance
root
|-- indexedFeatures: struct (nullable = true)
|-- indexedLabel: double (nullable = true)
+---------------+------------+
|indexedFeatures|indexedLabel|
+---------------+------------+
| []| 0.0|
| []| 0.0|
| []| 1.0|
| []| 0.0|
| []| 1.0|
+---------------+------------+
only showing top 5 rows
The encoder schema is correctly displaying the indexedFeatures row datatype to be a vector. But when I apply the encoder and do the transformation, it will give me a row of type struct, containing no real objects.
I would like to understand, why Spark is providing me with a struct type instead of the correct vector one.