I'm evaluating some machine learning models for a binary classification problem, and encountering weird results when adding one non-binary feature.
My dataset consists of tweets and some other values related to them, so the main feature vector is a sparse matrix (5000 columns) generated using scikit-learn's Tf-idf Vectoriser on the tweets and SelectKBest feature selection.
I have two other features I want to add, which are both 1-column dense matrices. I convert them to sparse and use scipy's hstack function to add them on to the main feature vector. The first of these features is binary, and when I add just that one all is good and I get accuracies of ~60%. However the second feature is integer values, and adding this causes varying results.
I am testing Logistic Regression, SVM (rbf), and Multinomial Naive Bayes. When adding the final feature the SVM accuracy increases to 80%, but for Logistic Regression it now always predicts the same class, and MNB is also very heavily skewed towards that class.
SVM confusion matrix
[[13112 3682]
[ 1958 9270]]
MNB confusion matrix
[[13403 9803]
[ 1667 3149]]
LR confusion matrix
[[15070 12952]
[ 0 0]]
Can anyone explain why this could be? I don't understand why this one extra feature could cause two of the classifiers to effectively become redundant but improve the other one so much? Thanks!