I have the following dataset for predicting whether a team wins a game or not, in which each row corresponds to a training example and each column corresponds to a particular feature. I wish to make the decision tree use a split based on each feature in each of the column on determining the final regression values:
Train=
[['0' '0' '1' '-1' '8' '-9']
['-15' '0' '0' '18' '7' '11']
['-8' '0' '0' '8' '2' '6']
...
['11' '0' '2' '-15' '-3' '-12']
['3' '0' '-1' '-16' '-15' '-1']
['-3' '0' '0' '-6' '4' '-10']]
Result=
[1,1,0,...,1]
Based on the output regression values(which is nothing but the probability with which they win), I apply a threshold function to classify the output as a '1' (wins) or a '0' (loses). This cannot be turned into a classification problem because the probability is an important intermediate step.
I was wondering if using the sci-kit learn decision tree classifier directly helps:
regr_2 = DecisionTreeRegressor(max_depth=6)
regr_2.fit(Train, Result)
I also saw this tutorial on decision trees and was also wondering if I should construct a decision tree from ground-up in this case. How is does the sci-kit learn function create the splits? Does it perform what I intend to do? Kindly let me know the possible flaws in my approach.
Also, what is the difference between max_features and max_depth?
DecisionTreeClassifierinstead - O.SuleimanDecisionTreeClassifierand then usepredict_proba()instead ofpredict()in order to get probabilities instead of classes. - O.Suleiman