0
votes

I have the following dataset for predicting whether a team wins a game or not, in which each row corresponds to a training example and each column corresponds to a particular feature. I wish to make the decision tree use a split based on each feature in each of the column on determining the final regression values:

 Train= 
 [['0' '0' '1' '-1' '8' '-9']
 ['-15' '0' '0' '18' '7' '11']
 ['-8' '0' '0' '8' '2' '6']
 ...
 ['11' '0' '2' '-15' '-3' '-12']
 ['3' '0' '-1' '-16' '-15' '-1']
 ['-3' '0' '0' '-6' '4' '-10']]

Result=
[1,1,0,...,1]

Based on the output regression values(which is nothing but the probability with which they win), I apply a threshold function to classify the output as a '1' (wins) or a '0' (loses). This cannot be turned into a classification problem because the probability is an important intermediate step.

I was wondering if using the sci-kit learn decision tree classifier directly helps:

regr_2 = DecisionTreeRegressor(max_depth=6)
regr_2.fit(Train, Result)

I also saw this tutorial on decision trees and was also wondering if I should construct a decision tree from ground-up in this case. How is does the sci-kit learn function create the splits? Does it perform what I intend to do? Kindly let me know the possible flaws in my approach.

Also, what is the difference between max_features and max_depth?

1
This is a classification problem, not regression. Therefore, you can use DecisionTreeClassifier instead - O.Suleiman
My main aim is to predict the probability with which I decide whether they win or not. Sorry for not being clear. Hence I cannot use DecisionTreeClassifier - thegreatcoder
Well, either way, this is still a classification problem, you can use the DecisionTreeClassifier and then use predict_proba() instead of predict() in order to get probabilities instead of classes. - O.Suleiman
@Shubashree you seem to ignore that most classifiers, including decision trees, actually produce a probability output, which is subsequently converted to a "hard" class membership (0/1) according to a threshold... - desertnaut
@Shubashree did I answer your question? - David Masip

1 Answers

2
votes

Sci-kit learn uses, by default, the gini impurity measure (see Giny impurity, Wikipedia) in order to split the branches in a decision tree. This usually works quite well and unless you have a good knowledge of your data and how the splits should be done it is preferable to use the Sci-kit learn default.

About max_depth: this is the maximum depth of your tree, you don't want it to be very large because you will probably overfit the training data.

About max_features: every time there is a split, your training algorithm looks at a number of features and takes the one with the optimal metric (in this case gini impurity), and creates two branches according to that feature. It is computationally heavy to look at all the features every single time, so you can just check some of them. max_features is then the number of features you look every time you create a pair of branches on a node.