Deep decision tree in PySpark

Question

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Load and parse the data file, converting it to a DataFrame.

data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

 # Index labels, adding metadata to the label column.
 # Fit on whole dataset to include all labels in index.

 labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(data)

 # Automatically identify categorical features, and index them.
 # Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.

 rf = RandomForestClassifier(labelCol="indexedLabel", 
      featuresCol="indexedFeatures", numTrees=500)

 # Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", 
outputCol="predictedLabel",
                           labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
 pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

 paramGrid_rf = ParamGridBuilder() \
   .addGrid(rf.maxDepth, [50,100,150,250,300]) \
   .build()

 crossval_rf = CrossValidator(estimator=pipeline,
                       estimatorParamMaps=paramGrid_rf,
                      evaluator=BinaryClassificationEvaluator(),
                      numFolds= 5) 

 cvModel_rf = crossval_rf.fit(trainingData)

The code above gives me the error message below.

Py4JJavaError: An error occurred while calling o12383.fit. : java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.

Dr Potato Dr Potato · Accepted Answer · 2018-11-20T20:06:45

From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html

...the current implmentation imposes a restriction of maxDepth <= 30:

https://github.com/apache/spark/blob/ca6955858cec868c878a2fd8528dbed0ef9edd3f/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L137

You could ask to increase that limit in github forum!

Deep decision tree in PySpark

1 Answers