0
votes

I use scikit-learn in Python to run RandomForestClassifier(). Because I want to visualize Random Forests to realize the correlation between different features, I use export_graphviz() to achieve this goal.

estimator1 = best_model1.estimators_[0]

from sklearn.tree import export_graphviz
export_graphviz(estimator1, 
                'tree_from_optimized_forest.dot', 
                rounded = True, 
                feature_names=X_train.columns,  
                class_names = ["No", "Yes"], 
                filled = True)

from subprocess import call
call(['dot', '-Tpng', 'tree_from_optimized_forest.dot', '-o', 'tree_from_optimized_forest.png', '-Gdpi=200'])

from IPython.display import Image
Image('tree_from_optimized_forest.png', "w")

However, unlike Decision Tree, Random Forests will produce many trees, which are depended on the number of n_estimators in RandomForestClassifier().

best_model1 = RandomForestClassifier(n_estimators= 100,
                                   criterion='gini',
                                   random_state= 42,
                                   )

Besides, because DecisionTreeClassifier() uses all the samples to produce just one tree, we can explain directly the results on this single tree.

In opposite, Random Forests is trained to make several different trees, then voting inside these trees to decide the result. In addition, the content of these trees are different because Random Forests has the methods of Bootstrap, Bagging, Out-of-bag...and so on.

Therefore, I want to ask that if I only visualize one of trees from the result of RandomForestClassifier(), whether this tree has a certain reference value?

Can I directly explain the content of this tree as the analysis result of whole data? if not, whether DecisionTreeClassifier() is the only way to analyze the correlation between features through visualized image?

Thanks a lot!!

1

1 Answers

0
votes

There have always been this relation in machine learning between the model's interpret-ability and complexity and your post is directly relating to this.

Some of the models that are quite simple but are used intensively for their interpret ability is the decision trees, but since they are not complex enough (suffer from a bias), they usually fail to learn very complex function, hence people came up with the random forest classifiers. Random forests reduce the bias of the vanilla decision tree and add more variance, but unfortunately in that process they took away the straightforward interpret-ability attribute with it.

Yet, there is still some tools that could help you gain some insight on the learnt function and the contribution of the features, one of those tools is treeinterpreter, you can learn more about it in this article.