Scala - MaxBins error - Decision Tree - Categorical variables

Question

My error is similar to these 2 posts , Tried those possibilities but still see the error below : : CLOUDERA && STACK OVERFLOW

   var categoricalFeaturesInfo = Map[Int, Int]()
       categoricalFeaturesInfo += (0 -> 31)
       categoricalFeaturesInfo += (1 -> 7)

java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 3) to be at least as large as the number of values in each categorical feature, but categorical feature 0 has 31 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

   val numClasses = 2
   val impurity = "gini"
   val maxDepth = 9
   val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,impurity, maxDepth, maxBins)

Questions : Largest Categorical variable is 31 , I have tried maxBins = 32 (as per answers in those posts). am I missing anything?

Just as trial n error , I tried all set of values like 2 , 3 10 , 15 , 50 , 10000 , See same error. !

Map function used :

val mlprep = flightsRDD.map(flight => {
  val monthday = flight.dofM.toInt - 1 // category
  val weekday = flight.dofW.toInt - 1 // category
})

I am trying to implement this algo using this mapR link. the code is very similar and map function used are same : mapr.com/blog/apache-spark-machine-learning-tutorial — San

Dr Potato Dr Potato · Accepted Answer · 2018-11-20T19:56:12

I had the same error using PySpark. It could be for many reasons:

1) To make sure maxBins is exact, make it equal to the maximum of the quantity of distinct categorical values for each categorical column.

maxBins = max(categoricalFeaturesInfo.values() )

2) The error message says

...but categorical feature 0 has 31 values...

Is column 0 (the very first one, not the first feature) of trainingData actually the labels of the training set? They must! DecisionTree.trainClassifier by default treat the first column as if it were the labels. Be sure labels column is the first one of trainingData and not one of the features.

3) How did you get trainingData ? DecisionTree.trainClassifier works for me with table parsed to LabeledPoint, just as RandomForest.trainClassifier, see http://jarrettmeyer.com/2017/05/04/random-forests-with-pyspark . (*)

4) Also, before transforming the dataset to a LabeledPoint RDD, first transform the original dataframe for indexing the categorical columns.

What works for me is first transforming the source data frame with a Pipeline, each stage consisting of a StringIndexer transformation for appending another column whose values are the indexed categorical column, and then transform them to a LabeledPoint.

In summary, the way it works for me in PySpark is as follows:

Suppose the original dataframe is stored in df variable and the array of names of its categorical features is stored in categoricalFeatures variable-list-array-whateverYouCallIt.

Import Pipeline and StringIndexer (*):

from pyspark.ml import Pipeline
pyspark.ml.feature import StringIndexer

To establish the pipeline stages, create an array of StringIndexer, each one indexing one categorical column (*). See https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer

indexers = [ StringIndexer(inputCol=column, outputCol=column) for column in categoricalFeatures ]

Be careful here because Spark version 1.6 doesn't have handleInvalid="keep" method implemented for StringIndexer instances, so you would need to replace NULL values after running this stages. See https://weishungchung.com/2017/08/14/stringindexer-transform-fails-when-column-contains-nulls/

Set the pipeline: (*)

pipeline = Pipeline( stages=indexers )

Now run the transformations:

df_r= pipeline.fit(df).transform(df)

If there are problems here try changing outputCol value for something different in indexers. If NULL values were present in df, a NullPointerException would raise.

Now all (categorical) columns in categoricalFeatures list are indexed in df_r. If you changed some value of outputCol when initializing indexers, you should remove that original column (whose name is inputCol value) from df_r.

And finally, declare your trainingData using labeled points: (*)

from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

trainingData = df_r.rdd.map(lambda row: LabeledPoint(row[0], Vectors.dense(row[1:])))

Here is where all columns of df_r must be numeric (thus categorical columns are already transformed to indexed columns) and the label column is the colum number 0 in df_r. If not, lets say it is column i, change it:

trainingData = df_r.rdd.map(lambda row: LabeledPoint(row[i], Vectors.dense(row[:i]+row[i+1:])))

Creating trainingData this way works for me.

There is also a fast and easy way to obtain categoricalFeaturesInfo from df_r metadata: Let k be the index of a categorical column transformed with StringIndexer,

df_r.schema.fields[k].metadata['ml_attr']['vals']

stores the original values, you only need to count them all for knowing how many distinct values are in that column number, and also you can recover the original values from there, in stead of using IndexToString.

Regards.

(*) With few changes you can do the same in Scala.

Scala - MaxBins error - Decision Tree - Categorical variables

1 Answers