I had the same error using PySpark. It could be for many reasons:
1) To make sure maxBins
is exact, make it equal to the maximum of the quantity of distinct categorical values for each categorical column.
maxBins = max(categoricalFeaturesInfo.values() )
2) The error message says
...but categorical feature 0 has 31 values...
Is column 0 (the very first one, not the first feature) of trainingData
actually the labels of the training set? They must! DecisionTree.trainClassifier
by default treat the first column as if it were the labels. Be sure labels column is the first one of trainingData
and not one of the features.
3) How did you get trainingData
?
DecisionTree.trainClassifier
works for me with table parsed to LabeledPoint
, just as RandomForest.trainClassifier
, see http://jarrettmeyer.com/2017/05/04/random-forests-with-pyspark . (*)
4) Also, before transforming the dataset to a LabeledPoint RDD, first transform the original dataframe for indexing the categorical columns.
What works for me is first transforming the source data frame with a Pipeline
, each stage consisting of a StringIndexer
transformation for appending another column whose values are the indexed categorical column, and then transform them to a LabeledPoint
.
In summary, the way it works for me in PySpark is as follows:
Suppose the original dataframe is stored in df
variable and the array of names of its categorical features is stored in categoricalFeatures
variable-list-array-whateverYouCallIt.
Import Pipeline
and StringIndexer
(*):
from pyspark.ml import Pipeline
pyspark.ml.feature import StringIndexer
To establish the pipeline stages, create an array of StringIndexer
, each one indexing one categorical column (*). See https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer
indexers = [ StringIndexer(inputCol=column, outputCol=column) for column in categoricalFeatures ]
Be careful here because Spark version 1.6 doesn't have handleInvalid="keep"
method implemented for StringIndexer
instances, so you would need to replace NULL
values after running this stages. See https://weishungchung.com/2017/08/14/stringindexer-transform-fails-when-column-contains-nulls/
Set the pipeline: (*)
pipeline = Pipeline( stages=indexers )
Now run the transformations:
df_r= pipeline.fit(df).transform(df)
If there are problems here try changing outputCol
value for something different in indexers
. If NULL
values were present in df
, a NullPointerException
would raise.
Now all (categorical) columns in categoricalFeatures
list are indexed in df_r
. If you changed some value of outputCol
when initializing indexers
, you should remove that original column (whose name is inputCol
value) from df_r
.
And finally, declare your trainingData
using labeled points: (*)
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
trainingData = df_r.rdd.map(lambda row: LabeledPoint(row[0], Vectors.dense(row[1:])))
Here is where all columns of df_r
must be numeric (thus categorical columns are already transformed to indexed columns) and the label column is the colum number 0 in df_r
. If not, lets say it is column i
, change it:
trainingData = df_r.rdd.map(lambda row: LabeledPoint(row[i], Vectors.dense(row[:i]+row[i+1:])))
Creating trainingData
this way works for me.
There is also a fast and easy way to obtain categoricalFeaturesInfo
from df_r
metadata: Let k
be the index of a categorical column transformed with StringIndexer
,
df_r.schema.fields[k].metadata['ml_attr']['vals']
stores the original values, you only need to count them all for knowing how many distinct values are in that column number, and also you can recover the original values from there, in stead of using IndexToString
.
Regards.
(*) With few changes you can do the same in Scala.
Map
implementation are you using? – Mateusz Kubuszok