I am evaluating my Decision Tree Classifier, and I am trying to plot feature importances. The graph prints out correctly, but it prints all (80+) features, which creates a very messy visual. I am trying to figure out how I can limit the plotting to only variables that are important, in the order of importance.
The link to the dataset for you to download to your working directory, named ("File): https://github.com/Arsik36/Python
Minimum reproducible code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
file = 'file.xlsx'
my_df = pd.read_excel(file)
# Determining response variable
my_df_target = my_df.loc[ :, 'Outcome']
# Determining explanatory variables
my_df_data = my_df.drop('Outcome', axis = 1)
# Declaring train_test_split with stratification
X_train, X_test, y_train, y_test = train_test_split(my_df_data,
my_df_target,
test_size = 0.25,
random_state = 331,
stratify = my_df_target)
# Declaring class weight
weight = {0: 455, 1:1831}
# Instantiating Decision Tree Classifier
decision_tree = DecisionTreeClassifier(max_depth = 5,
min_samples_leaf = 25,
class_weight = weight,
random_state = 331)
# Fitting the training data
decision_tree_fit = decision_tree.fit(X_train, y_train)
# Predicting on the test data
decision_tree_pred = decision_tree_fit.predict(X_test)
# Declaring the number of features in the X_train data
n_features = X_train.shape[1]
# Setting the plot window
figsize = plt.subplots(figsize = (12, 9))
# Specifying the contents of the plot
plt.barh(range(n_features), decision_tree_fit.feature_importances_, align = 'center')
plt.yticks(pd.np.arange(n_features), X_train.columns)
plt.xlabel("The degree of importance")
plt.ylabel("Feature")
Current output I am trying to limit to only important features:
decision_tree_fit.feature_importances_[tree.feature_importances_> 0.0.5]
in plt.barh – Cristian Contrera