I have a Dataset<Row> df
, that contains two columns ("key" and "value") of type string
. df.printSchema(); is giving me the following output:
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
The content of the value column is actually a csv formated line (coming from a kafka topic), with the last entry of that line representing the class label and all the previous entries beeing the features (first row not included in the dataset):
feature0,feature1,label
0.6720004294237854,-0.4033586564886893,0
0.6659082469383558,0.07688976580256132,0
0.8086502311695247,0.564354801275521,1
Since I would like to train a classifier on this data, I need to transform this representation into a row of type dense vector, containing all the feature values and a column of type double, containing the label value:
root
|-- indexedFeatures: vector (nullable = false)
|-- indexedLabel: double (nullable = false)
How can I do this, using java 1.8 and Spark 2.2.0?
Edit: I got further, but while attempting to make it work with a flexible amount feature dimensions, I got stuck again. I created a follow-up question.
value
as individual columns? From there, spark has an assembler that you can use to create a feature vector. Look atspark.read.csv
– ernest_k