1
votes

I want to build an algorithm that classifies text: ham or spam; I have the train/test data for each category of text. (my train data has for each category 8000 sentences, and for test each category contains 2000 sentences)

X_train looks like this ['please, call me asap!', 'watch out the new sales!', 'hello jim can we talk?', 'only today you can buy this', 'don't miss our offer!']

y_train looks like this [1 0 1 0 0] where 1 = ham, 0 = spam

the same with X_test and y_test.

This is a snippet of my code:

# classifier can be LogisticRegression, MultinomialNB, RandomForest, DecisionTree
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', classifier),
                    ])
model = text_clf.fit(X_train, y_train)
y_predict = model.predict(X_test)

And these are the parameters that I measure:

print(accuracy_score(y_test, y_predict))
print(f1_score(y_test, y_predict, average="weighted"))
print(recall_score(y_test, y_predict, pos_label=1, average="binary"))
print(precision_score(y_test, y_predict, average="weighted"))

If I don't use any optimization (remove stop words, remove punctuation, stem words, lemmatize words) I obtain results around 95% each parameter. If I use those optimizations, the accuracy, f1 score and precision decrease drastically to 50-60%. The recall function stays the same at 95%.

Why is this happening? Where am I mistaking? Did I calculate right those parameters? Or this is a normal behavior?

1
Usually "optimization" means trading one thing for another, e.g. accuracy for speed of training. Were you expecting something for nothing? - balmy
If I call them features will you try to told me why this behavior? - Mr. Wizard
I would try adding the optimizations one at a time, to see what their effects are individually. If there's a particular one that's causing this behavior, take a look at what it's doing. - Silenced Temporarily
I added them one after another, the same results I obtained. Each of them will decrease those parameters. - Mr. Wizard
Please add the complete code (with and without optimization) with dataset. - Vivek Kumar

1 Answers

0
votes

I just figured out what is wrong: underfitting. I performed cross-validation

scores = cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy')

and now everything is fine, I obtain the results I was expecting.