Why does cross validation RF classification perform worse than without cross validation?

Question

I am puzzled why a Random Forest classification model without cross-validation yields a mean accuracy score of .996, but with 5 fold cross-validation, the model's mean accuracy score is .687.

There are 275,956 samples. Class 0 = 217891, class 1 = 6073, class 2 = 51992

I am trying to predict the "TARGET" column, which is 3 classes [0,1,2]:

data.head()
bottom_temperature  bottom_humidity top_temperature top_humidity    external_temperature    external_humidity   weight  TARGET  
26.35   42.94   27.15   40.43   27.19   0.0  0.0    1   
36.39   82.40   33.39   49.08   29.06   0.0  0.0    1   
36.32   73.74   33.84   42.41   21.25   0.0  0.0    1

From the docs, the data is split into training and test

# link to docs http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

# Create a list of the feature column's names
features = data.columns[:7]

# View features
features
Out[]: Index([u'bottom_temperature', u'bottom_humidity', u'top_temperature',
       u'top_humidity', u'external_temperature', u'external_humidity',
       u'weight'],
      dtype='object')


#split data
X_train, X_test, y_train, y_test = train_test_split(data[features], data.TARGET, test_size=0.4, random_state=0)

#build model
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(X_train, y_train)

#predict
preds = clf.predict(X_test)

#accuracy of predictions
accuracy = accuracy_score(y_test, preds)
print('Mean accuracy score:', accuracy)

('Mean accuracy score:', 0.96607267423425713)

#verify - its the same
clf.score(X_test, y_test)
0.96607267423425713

Onto the cross validation:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, data[features], data.TARGET, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.69 (+/- 0.07)

It is much lower!

And to verify a second way:

#predict with CV
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, data[features], data.queen3, cv=5)
metrics.accuracy_score(data.queen3, predicted) 

Out[]: 0.68741031178883594

From my understanding, Cross validation should not decrease the accuracy of predictions by this amount, but improve the model's prediction because the model has seen a "better" representation of all the data.

What is the approximate split between your target classes (0,1 and 2)? And how many samples do you have? — Stev
No the data is sorted by two deleted columns: groupID and timestamp (it is a longitudinal repeated measures dataset with around 80 individual environments monitored) — Evan
StratifiedKFold (used in cross_val_score) doesn't shuffle before splitting by default. My understanding is that this means you are also trying to predict a temporal element to your data when doing cross-validation. Unlike in train_test_split, which randomly pulls from the dataset. As a test, try setting cv=ms.StratifiedKFold(n_splits=5, shuffle=True). — Stev
So all fixed? If so, great, I'll put it as an answer because it's important that people know about this behaviour. — Stev

Stev Stev · Accepted Answer · 2018-03-29T15:16:41

Normally I would agree with Vivek and tell you to trust your cross-validation.

However, some level of CV is inherent in a random forest because each tree is grown from a bootstrapped sample, so you shouldn’t expect to see such a large reduction in accuracy when running cross-validation. I suspect your problem is due to some sort of time- or location-dependency in your data sorting.

When you use train_test_split, data is drawn randomly from the dataset, so all 80 of your environments are likely to be present in your train and test datasets. However, when you split using the default options for CV, I believe that each of the folds is drawn in order, so each of your environments is not present within every fold (assuming your data is ordered by environment). This leads to a lower accuracy because you are predicting one environment using data from another.

The simple solution is to set cv=ms.StratifiedKFold(n_splits=5, shuffle=True).

I have run into this problem several times before when using concatenated datasets and there must be hundreds of others who have and haven’t realised what the issue is. The idea of the default behaviour is to maintain order in a time series (from what I have seen in GitHub discussions).

Why does cross validation RF classification perform worse than without cross validation?

3 Answers