Using Sklearn, I am doing supervised learning in Python with Logistic regression. I am also using cross validation to test my prediction accuracies.
I wanted to test if I have similar results when I do the cross validation by myself. Here is the results:
# X is my features. (m x p)
# y is labels. (m x 1)
# Using cross_validation.cross_val_score() function:
classifier = LogisticRegression()
scores1 = cross_validation.cross_val_score(classifier, X, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores1.mean(), scores1.std() * 2))
# Doing it "manual":
scores2 = np.array( [] )
classifier = LogisticRegression()
for i in range(10):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,
test_size=0.1, random_state=i)
classifier.fit(X_train,y_train)
score = classifier.score(X_test, y_test)
scores2 = np.append(scores2, score)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))
# This prints:
# Accuracy: 0.72 (+/- 0.47)
# Accuracy: 0.58 (+/- 0.52)
I am having fairly big X and y. So I was not expecting to have a big difference in the results. Is this difference totally due to the nature of the randomness of the process or do I miss anything in my code?
Here is the documentation page for cross_validation.cross_val_score():
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html
Here is the documentation page for cross_validation.train_test_split():
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html