Spark MLLib 2.0 Categorical Features in pipeline

Question

I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like

StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier

However I get the following error:

DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 5 has 49 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

I can resolve this issue by using a Normalizer, but then the resultant Decision tree is unusable for my needs, as I need to generate a DSL decision tree with the original feature values. I can't manually set the maxBins because the whole pipeline is executed together. I would like the resultant decision tree to have the StringIndexer generated values (e.g. Feature 5 <= 132). Additionally, but less important, I'd like to be able to specify my own names for the features (e.g. instead of 'Feature 5', say 'domain')

igorsf igorsf · Accepted Answer · 2016-08-05T05:50:39

The DecisionTree algorithm takes a single maxBins value to decide the number of splits to take. The default value is (=32). maxBins should be greater or equal to the maximum number of categories for categorical features. Since your feature 5 has 49 different values you need to increase maxBins to 49 or greater.

The DecisionTree algorithm has several hyperparameters, and tuning them to your data can improve accuracy. You can do this tuning using Spark's Cross Validation framework, which automatically tests a grid of hyperparameters and chooses the best.

Here is example in python testing 3 maxBins [49, 52, 55]

dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
paramGrid = ParamGridBuilder().addGrid(dt.maxBins, [49, 52, 55]).addGrid(dt.maxDepth, [4, 6, 8]).addGrid(rf.impurity, ["entropy", "gini"]).build()
pipeline = Pipeline(stages=[labelIndexer, typeIndexer, assembler, dt])

Spark MLLib 2.0 Categorical Features in pipeline

1 Answers