I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like
StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier
However I get the following error:
DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 5 has 49 values. Considering remove this and other categorical features with a large number of values, or add more training examples.
I can resolve this issue by using a Normalizer, but then the resultant Decision tree is unusable for my needs, as I need to generate a DSL decision tree with the original feature values. I can't manually set the maxBins because the whole pipeline is executed together. I would like the resultant decision tree to have the StringIndexer generated values (e.g. Feature 5 <= 132). Additionally, but less important, I'd like to be able to specify my own names for the features (e.g. instead of 'Feature 5', say 'domain')