1
votes

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.

Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.

When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.

I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.

2

2 Answers

0
votes

When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.

By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.

For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.

To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.

0
votes

Actually, what you did is right and this process is something similar to "Stratified sampling". In your first model,where accuracy was very low the model did not get enough correlations between features and target for positive class(1).Also it model might have somewhat over-fitted for negative class.This is called "High bias -High variance" situation.

"Stratified sampling" is nothing but when you are extracting a sample data from a big population,you make sure that all classes will have some what approximately equal proportion to make the model's training assumptions more accurate and reliable.

In the second case model was able to correlate relationships between features and target and positive and negative class characteristics was well distinguishable. Eliminating noise is a part of data preparation that should be obviously done before putting data into a model.