0
votes

I am evaluating my Decision Tree Classifier, and I am trying to plot feature importances. The graph prints out correctly, but it prints all (80+) features, which creates a very messy visual. I am trying to figure out how I can limit the plotting to only variables that are important, in the order of importance.

The link to the dataset for you to download to your working directory, named ("File): https://github.com/Arsik36/Python

Minimum reproducible code:

   import pandas as pd
        import matplotlib.pyplot as plt
        import seaborn as sns
        from sklearn.model_selection import train_test_split  
        from sklearn.tree import DecisionTreeClassifier

        file = 'file.xlsx'
        my_df = pd.read_excel(file)

        # Determining response variable
        my_df_target = my_df.loc[ :, 'Outcome']

        # Determining explanatory variables
        my_df_data = my_df.drop('Outcome', axis = 1)

        # Declaring train_test_split with stratification
        X_train, X_test, y_train, y_test = train_test_split(my_df_data,
                                                            my_df_target,
                                                            test_size = 0.25,
                                                            random_state = 331,
                                                            stratify = my_df_target)

    # Declaring class weight
    weight = {0: 455, 1:1831}

    # Instantiating Decision Tree Classifier
    decision_tree = DecisionTreeClassifier(max_depth = 5,
                                           min_samples_leaf = 25,
                                           class_weight = weight,
                                           random_state = 331)

    # Fitting the training data
    decision_tree_fit = decision_tree.fit(X_train, y_train)

    # Predicting on the test data
    decision_tree_pred = decision_tree_fit.predict(X_test)

# Declaring the number of features in the X_train data
n_features = X_train.shape[1]

# Setting the plot window
figsize = plt.subplots(figsize = (12, 9))

# Specifying the contents of the plot
plt.barh(range(n_features), decision_tree_fit.feature_importances_, align = 'center')
plt.yticks(pd.np.arange(n_features), X_train.columns)
plt.xlabel("The degree of importance")
plt.ylabel("Feature")

Current output I am trying to limit to only important features: enter image description here

1
Use decision_tree_fit.feature_importances_[tree.feature_importances_> 0.0.5] in plt.barhCristian Contrera
@CristianContrera I tried this, but it doesn't quite work. The features that are outside condition are still present, only their bar values disappear they do not satisfy the condition. Do you know how I can eliminate features from graphs that don't satisfy the condition? When I try your solution, I get the following error: ValueError: shape mismatch: objects cannot be broadcast to a single shapeArsik36

1 Answers

1
votes

You need modify all your plot code to remove low importance features, try this (untested):

# Setting the plot window
figsize = plt.subplots(figsize = (12, 9))

featues_mask = tree.feature_importances_> 0.005

# Specifying the contents of the plot
plt.barh(range(sum(featues_mask)), tree.feature_importances_[featues_mask], align = 'center')
plt.yticks(pd.np.arange(sum(featues_mask)), X_train.columns[featues_mask])
plt.xlabel("The degree of importance")
plt.ylabel("Feature")