2
votes

Using Sklearn, I am doing supervised learning in Python with Logistic regression. I am also using cross validation to test my prediction accuracies.

I wanted to test if I have similar results when I do the cross validation by myself. Here is the results:

# X is my features. (m x p)
# y is labels. (m x 1)

# Using cross_validation.cross_val_score() function:
classifier = LogisticRegression()
scores1 = cross_validation.cross_val_score(classifier, X, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores1.mean(), scores1.std() * 2))

# Doing it "manual":
scores2 = np.array( [] )
classifier = LogisticRegression()
for i in range(10):
   X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,
                                                    test_size=0.1, random_state=i)
   classifier.fit(X_train,y_train)
   score = classifier.score(X_test, y_test)
   scores2 = np.append(scores2, score)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))

# This prints:
# Accuracy: 0.72 (+/- 0.47)
# Accuracy: 0.58 (+/- 0.52) 

I am having fairly big X and y. So I was not expecting to have a big difference in the results. Is this difference totally due to the nature of the randomness of the process or do I miss anything in my code?

Here is the documentation page for cross_validation.cross_val_score():

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html

Here is the documentation page for cross_validation.train_test_split():

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

1

1 Answers

2
votes

train_test_split uses a randomized training and test set split, while cross_val_score(cv=10) uses stratified k-fold cross-validation.

Try using cv=ShuffleSplit(test_size=0.1). That should give you more similar results. It will not use the same random seeding you did, so they might still vary. It would be weird if they are outside each others std though.