0
votes

I am using Random Forest algorithm for classification in Spark MLlib using PySpark. My codes are as follows:\

model = RandomForest.trainClassifier(trnData, numClasses=3, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32)

predictions = model.predict(tst_dataRDD.map(lambda x: x.features))

labelsAndPredictions = tst_dataRDD.map(lambda lp: lp.label).zip(predictions)

testErr = labelsAndPredictions.filter(lambda x: x[0] != x[1]).count() / float(tst_dataRDD.count())

I got IllegalArgumentException: GiniAggregator given label -0.0625but requires label to be non-negative.
How can I solve this problem? Thanks

2
full stacktrace please?Som

2 Answers

0
votes

It seems for Gini impurity during multiclass classification, the labels must be positive (>=0). Please check if there are any negative labels present.

ref - spark repo

Also, on side note, please use algorithm from ml package and not from legacy mllib