I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
.build()
crossval_rf = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid_rf,
evaluator=BinaryClassificationEvaluator(),
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit. : java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.