3
votes

I have a skewed dataset (5,000,000 positive examples and only 8000 negative [binary classified]) and thus, I know, accuracy is not a useful model evaluation metric. I know how to calculate precision and recall mathematically but I am unsure how to implement them in python code.

When I train the model on all the data I get 99% accuracy overall but 0% accuracy on the negative examples (ie. classifying everything as positive).

I have built my current model in Pytorch with the criterion = nn.CrossEntropyLoss() and optimiser = optim.Adam().

So, my question is, how do I implement precision and recall into my training to produce the best model possible?

Thanks in advance

2

2 Answers

1
votes

The implementation of precision, recall and F1 score and other metrics are usually imported from the scikit-learn library in python.

link: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Regarding your classification task, the number of positive training samples simply eclipse the negative samples. Try training with reduced number of positive samples or generating more negative samples. I am not sure deep neural networks could provide you with an optimal result considering the class skewness.

Negative samples can be generated using the Synthetic Minority Over-sampling Technique (SMOT) technique. This link is a good place to start. Link: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

Try using simple models such as logistic regression or random forest first and check if there is any improvement in the F1 score of the model.

1
votes

To add to the other answer, some classifiers have a parameter called class_weight which let's you modify the loss function. By penalizing wrong predictions on the minority class more, you can train your classifier to learn to predict both classes. For a pytorch specific answer, you can refer this link

As mentioned in the other answer, over and undersampling strategies can be used. If you are looking for something better, take a look at this paper