0
votes

I'm working on a classification model and I have a problem to create the correct form of data for the model. In my dataset there are 3 columns with sums. I discretized these columns with the given Bucketizer. The rest of the columns are categorical with Strings as values. I used the StringIndexer to transform these features. Afterwards I select the best columns via ChiSqSelector. So far so good. But now I want to transform the categorical features in dummy variables. I don't know how to do that because I already have the data in the form of LabeledPoints. Is there a easy way or given solution to transform the values from a set of vectors to dummy variables? Or does anyone has a suggestion to solve this problem in another way?

1
Could you provide a reproducible example? It is not cleat how a set of vectors corresponds to LabeledPoints.zero323

1 Answers

0
votes

@zero323 The input for ChiSqSelector has to be an RDD[LabeledPoint]. My data has 25 features.I select the 15 best features but for simplicity let's say I have the following LabeledPoints:

LabeledPoint(1, [1, 2, 3])
LabeledPoint(0, [2, 1, 3])
LabeledPoint(1, [1, 3, 1])

For example ChiSqSelector selects only the best (first) feature so my LabeledPoints are:

LabeledPoint(1, [1])
LabeledPoint(0, [2])
LabeledPoint(1, [1])

How can I encode the features from the feature vector to dummy variables now that my LabeledPoints are:

LabeledPoint(1, [1, 0])
LabeledPoint(0, [0, 1])
LabeledPoint(1, [1, 0])

Hope that helps. Or do you need some code?

Edit: My idea right now is something like this: Convert the label and features from each LabeledPoint to a Row and convert this RDD to DataFrame to use the OneHotEncoder:

val data = chiData.map{ r=>
   val label = r.label
   val feature1 = r.features.toArray(0)
   val feature2 = r.features.toArray(1)
   val feature3 = r.features.toArray(2)
   ....
   Row(label, feature1, feature2, feature3, ...)
}
//Convert RDD to DataFrame
//Use OneHotEncoder
//Create LabeledPoints again for use in Algorithms

But I think this is not the smartest way.