I'm working on a classification model and I have a problem to create the correct form of data for the model. In my dataset there are 3 columns with sums. I discretized these columns with the given Bucketizer. The rest of the columns are categorical with Strings as values. I used the StringIndexer to transform these features. Afterwards I select the best columns via ChiSqSelector. So far so good. But now I want to transform the categorical features in dummy variables. I don't know how to do that because I already have the data in the form of LabeledPoints. Is there a easy way or given solution to transform the values from a set of vectors to dummy variables? Or does anyone has a suggestion to solve this problem in another way?
1 Answers
0
votes
@zero323 The input for ChiSqSelector has to be an RDD[LabeledPoint]. My data has 25 features.I select the 15 best features but for simplicity let's say I have the following LabeledPoints:
LabeledPoint(1, [1, 2, 3])
LabeledPoint(0, [2, 1, 3])
LabeledPoint(1, [1, 3, 1])
For example ChiSqSelector selects only the best (first) feature so my LabeledPoints are:
LabeledPoint(1, [1])
LabeledPoint(0, [2])
LabeledPoint(1, [1])
How can I encode the features from the feature vector to dummy variables now that my LabeledPoints are:
LabeledPoint(1, [1, 0])
LabeledPoint(0, [0, 1])
LabeledPoint(1, [1, 0])
Hope that helps. Or do you need some code?
Edit: My idea right now is something like this: Convert the label and features from each LabeledPoint to a Row and convert this RDD to DataFrame to use the OneHotEncoder:
val data = chiData.map{ r=>
val label = r.label
val feature1 = r.features.toArray(0)
val feature2 = r.features.toArray(1)
val feature3 = r.features.toArray(2)
....
Row(label, feature1, feature2, feature3, ...)
}
//Convert RDD to DataFrame
//Use OneHotEncoder
//Create LabeledPoints again for use in Algorithms
But I think this is not the smartest way.
LabeledPoints
. – zero323