I know little on how random forest works. Usually in classification I could fit the train data into the random forest classifier and ask to predict the test data.
Currently I am working on titanic data that is provided to me. This is a top rows of the data set and there are 1300(approx) rows.
survived pclass sex age sibsp parch fare embarked
0 1 1 female 29 0 0 211.3375 S
1 1 1 male 0.9167 1 2 151.55 S
2 0 1 female 2 1 2 151.55 S
3 0 1 male 30 1 2 151.55 S
4 0 1 female 25 1 2 151.55 S
5 1 1 male 48 0 0 26.55 S
6 1 1 female 63 1 0 77.9583 S
7 0 1 male 39 0 0 0 S
8 1 1 female 53 2 0 51.4792 S
9 0 1 male 71 0 0 49.5042 C
10 0 1 male 47 1 0 227.525 C
11 1 1 female 18 1 0 227.525 C
12 1 1 female 24 0 0 69.3 C
13 1 1 female 26 0 0 78.85 S
There is no test data given. So I want random forest to predict the survival on entire data set and compare it with actual value (more like checking the accuracy score).
So what I have done is divide my complete dataset into two parts; one with features and other one predict(survived). Features consists all the columns except survived and predict consists survived column.
dfFeatures = df['survived']
dfTarget = dfCopy.drop('survived', 1)
Note: df is the entire dataset.
Here is the code that checks the score of randomforest
rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
rfClf = rfClf.fit(dfFeatures, dfTarget)
scoreForRf = rfClf.score(dfFeatures, dfTarget)
I get the score output with something like this
The accuracy score for random forest is : 0.983193277311
I am finding it little difficult to understand what is happening behind the code in above given code.
Does, it predict survival for all the tuples based upon other features (dfFeatures
) and compare it with test data(dfTarget
) and give the prediction score or does it randomly create train and test data based upon the train data provided and compare accuracy for test data it generated behind?
To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?
feature selection
. I highly recommend reading all parts of this post about using random forests on the same dataset. – Omid