0
votes

I would like to build a decision tree classifier using Python, but I want to force the tree, regardless of what it thinks is best, to only split one node each time into two leaves. That is, each time, a node splits into a terminal leaf and another interior node that will continue to split, rather than into two interior nodes that can themselves split. I want one of the splits to end in a termination each time, until you end up having two leaves with below the minimum number.

For instance, the following tree satisfies this requirement enter image description here

but the second one does not: enter image description here

The reason I want to do this is to obtain a nested set of splits of observations. I saw on another post (Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)) that the node IDs of observations can be found, which is crucial. I realize I can do this by building a tree without such a restriction, and going up one of the leaf nodes to the top, but this may not give enough observations, and I would essentially like to get this nested structure over all the observations in the dataset.

In my application I actually don't care about the classification task, I just want to obtain this nested set of observations formed by splits on features. I had planned on making the target variable randomly generated so the split on features would not be meaningful (which is counter-intuitive that this is what I want, but I'm using it for a different purpose). Alternatively, if someone knows of a similar binary split method in Python which achieves the same purpose, please let me know.

1

1 Answers

0
votes

I realized one way to do this is to construct a decision tree with max_depth=1. This will perform a split into two leaves. Then pick the leaf with the highest impurity to continue splitting, fitting the decision tree again to only this subset, and repeat. To make sure that the hierarchy is obvious, I relabel the leaf_ids so it is clear that as you go up the tree, the ID values go down. Here is an example:

import numpy as np
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

def decision_tree_one_path(X, y=None, min_leaf_size=3):
    nobs = X.shape[0]

    # boolean vector to include observations in the newest split
    include = np.ones((nobs,), dtype=bool)
    # try to get leaves around min_leaf_size
    min_leaf_size = max(min_leaf_size, 1)
    # one-level DT splitter
    dtmodel = DecisionTreeClassifier(splitter="best", criterion="gini", max_depth=1, min_samples_split=int(np.round(2.05*min_leaf_size)))             
    leaf_id = np.ones((nobs,), dtype='int64')
    iter = 0

    if y is None:
        y = np.random.binomial(n=1, p=0.5, size=nobs)

    while nobs >= 2*min_leaf_size:

        dtmodel.fit(X=X.loc[include], y=y[include])
        # give unique node id
        new_leaf_names = dtmodel.apply(X=X.loc[include])
        impurities = dtmodel.tree_.impurity[1:]
        if len(impurities) == 0:
            # was not able to split while maintaining constraint
            break

        # make sure node that is not split gets the lower node_label 1
        most_impure_node = np.argmax(impurities)
        if most_impure_node == 0: # i.e., label 1
            # switch 1 and 2 labels above
            is_label_2 = new_leaf_names == 2
            new_leaf_names[is_label_2] = 1
            new_leaf_names[np.logical_not(is_label_2)] = 2

        # rename leaves
        leaf_id[include] = iter + new_leaf_names
        will_be_split = new_leaf_names == 2

        # ignore the other one
        tmp = np.ones((nobs,), dtype=bool)
        tmp[np.logical_not(will_be_split)] = False
        include[include] = tmp

        # now create new labels
        nobs = np.sum(will_be_split)
        iter = iter + 1

    return leaf_id

leaf_id is thus the leaf IDs of observations in order. Thus, for instance leaf_id==1 is the first observations that were split off into a terminal node. leaf_id==2 is the next one that was split off into a terminal node from the split in which leaf_id==1 was generated, as shown below. There are thus k+1 leaves.

#0
#|\
#1 .
#  |\
#  2 .
#.......
#
#     |\ 
#     k (k+1) 
   

I was wondering if there is a way to do this automatically in Python, though.