0
votes

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:

  1. With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?

  2. I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?

  3. The confusion matrix that I have generated is as follows:

    [[   0  395]
     [   0 3319]]
    

i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?

1

1 Answers

1
votes
  1. Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).

  2. The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.

  3. Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?