2
votes

I know little on how random forest works. Usually in classification I could fit the train data into the random forest classifier and ask to predict the test data.

Currently I am working on titanic data that is provided to me. This is a top rows of the data set and there are 1300(approx) rows.

survived pclass sex age sibsp parch fare embarked 0 1 1 female 29 0 0 211.3375 S 1 1 1 male 0.9167 1 2 151.55 S 2 0 1 female 2 1 2 151.55 S 3 0 1 male 30 1 2 151.55 S 4 0 1 female 25 1 2 151.55 S 5 1 1 male 48 0 0 26.55 S 6 1 1 female 63 1 0 77.9583 S 7 0 1 male 39 0 0 0 S 8 1 1 female 53 2 0 51.4792 S 9 0 1 male 71 0 0 49.5042 C 10 0 1 male 47 1 0 227.525 C 11 1 1 female 18 1 0 227.525 C 12 1 1 female 24 0 0 69.3 C 13 1 1 female 26 0 0 78.85 S

There is no test data given. So I want random forest to predict the survival on entire data set and compare it with actual value (more like checking the accuracy score).

So what I have done is divide my complete dataset into two parts; one with features and other one predict(survived). Features consists all the columns except survived and predict consists survived column.

dfFeatures = df['survived']
dfTarget = dfCopy.drop('survived', 1)

Note: df is the entire dataset.

Here is the code that checks the score of randomforest

rfClf = RandomForestClassifier(n_estimators=100, max_features=10)
rfClf = rfClf.fit(dfFeatures, dfTarget)
scoreForRf = rfClf.score(dfFeatures, dfTarget)

I get the score output with something like this

The accuracy score for random forest is :  0.983193277311

I am finding it little difficult to understand what is happening behind the code in above given code.

Does, it predict survival for all the tuples based upon other features (dfFeatures) and compare it with test data(dfTarget) and give the prediction score or does it randomly create train and test data based upon the train data provided and compare accuracy for test data it generated behind?

To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?

1
If you manually divide data set into train and test , yes , it's predict survived column in training set and try to match it with test set , and this your accuracy score.ᴀʀᴍᴀɴ
@Arman what if I don't divide into training set and test set. Does it not randomly generate test set (67-23) behind the scene?Cybercop
I think so , with respect to a parameter that describe how much is test set and how much training set , how ever I'm not exactly sure , maybe in that situation the accuracy is training accuracy score not test accuracy scoreᴀʀᴍᴀɴ
@Cybercop Test/train split is actually done by subsetting the rows not columns! the latter is actually called feature selection. I highly recommend reading all parts of this post about using random forests on the same dataset.Omid
@Omid yes train and test split are done on rows and not columns. And i also know feature selection is for splitting a node of a tree. But how does that help my question?Cybercop

1 Answers

2
votes

Somehow i dont see you're trying to split the dataset into train and test

dfWithTestFeature = df['survived']

dfWithTestFeature contains only the column survived, which is the labels.

dfWithTrainFeatures = dfCopy.drop('survived', 1)

dfWithTrainFeatures contain all the feature (pclass, sex, age, etc).

and now jumping to the code,

rfClf = RandomForestClassifier(n_estimators=100, max_features=10)

the line above is creating the random forest classifier, n_estimator is depth of the tree, higher number of this will lead to overfit the data.

rfClf = rfClf.fit(dfWithTrainFeatures, dfWithTestFeature) 

line above is training process, the .fit() need 2 parameter, first for the feature, and second is the label (or target value which is the value from 'survived' column) from the features.

scoreForRf = rfClf.score(dfWithTrainFeatures, dfWithTestFeature)

.score() needs 2 parameter, 1st is features and 2nd is labels. This is for using the model that we created using the .fit() function to predict the features in 1st parameter, while the 2nd parameter will be validation value.

from what i see, you're using same data to train and test the model which is not good.

To be more precise, while calculating the accuracy score does it predict the survival for entire data set or just random partial data set?

you used all data to test the model.

I could use cross validation but then again question is do I have to for random forest? Also cross validation for random forest seems to be very slow

of course, you need to use validation to test your model. Create confusion matrix, count precision and recall, don't just depends on the accuracy.

if you think the model is running too slow, then decrease n_esimators value.