0
votes

I'm evaluating some machine learning models for a binary classification problem, and encountering weird results when adding one non-binary feature.

My dataset consists of tweets and some other values related to them, so the main feature vector is a sparse matrix (5000 columns) generated using scikit-learn's Tf-idf Vectoriser on the tweets and SelectKBest feature selection.

I have two other features I want to add, which are both 1-column dense matrices. I convert them to sparse and use scipy's hstack function to add them on to the main feature vector. The first of these features is binary, and when I add just that one all is good and I get accuracies of ~60%. However the second feature is integer values, and adding this causes varying results.

I am testing Logistic Regression, SVM (rbf), and Multinomial Naive Bayes. When adding the final feature the SVM accuracy increases to 80%, but for Logistic Regression it now always predicts the same class, and MNB is also very heavily skewed towards that class.

SVM confusion matrix
[[13112  3682]
 [ 1958  9270]]

MNB confusion matrix
[[13403  9803]
 [ 1667  3149]]

LR confusion matrix
[[15070 12952]
 [    0     0]]

Can anyone explain why this could be? I don't understand why this one extra feature could cause two of the classifiers to effectively become redundant but improve the other one so much? Thanks!

1
Are the classse for that feature highly imbalanced? - juanpa.arrivillaga
No the classes are pretty balanced for all data that I've tested it on, and it always gives results like this - eb94

1 Answers

0
votes

Sounds like your extra feature is non-linear. NB and LR both assume that the features are linear. SVM only assumes that the variables are linearly separable. Intuitively this means that there is a "cut-off" value for your variable that the SVM is optimizing for. If you still want to use LR or NB, you could try transforming this variable to make it linear or otherwise you could try converting it to a binary indicator variable based on this threshold, and you might improve your model's performace.

Take a look at https://stats.stackexchange.com/questions/182329/how-to-know-whether-the-data-is-linearly-separable for some further reading.