0
votes

I am newby on data science and would like to ask for help of model selection.

I have built 8 models to predict Salary vs year exp, position name and location. Then, I tried to compare 8 models by RMSE. But finally, I am not sure that which model I should select. (In m mind, I prefer model 8 because after test with random forest, the result is better than Regression, then I have used all data set to make final version but it is more difficult to interpret coef than regression) Can you help which model do you prefer and why? And in reality, do data scientist do the process like this or they have automatic way to deal with?

1 RMSElm1 : model: linear regression, data: Train 80%, test 20% No any imputation = 22067.58

2 RMSElm2:model: linear regression, data: Train 80%, test 20%: Imputation some locations which I think they give the same idea of salary = 22115.64

3 RMSElm3: model: linear regression+ Stepwise, data: Train 80%, test 20% No any imputation = 22081.06

4 RMSEdeep1: model: Deep learning (H2O package activation = 'Rectifier', hidden c(5,5),epochs = 100,), data: Train 80%, test 20%: No any imputation = 16265.13

5 RMSErf1: model: Random forest (ntree =10),data: Train 80%, test 20% No any imputation = 14669.92

6 RMSErf2: model: Random forest (ntree =500),data: Train 80%, test 20% No any imputation [1] 14669.92

7 RMSErf3: model: Random forest (ntree =10,)data: K-Fold 10 No any imputation [1] 14440.82

8 RMSErf4 model: Random forest (ntree =10),data: all dataset No any imputation [1] 13532.74

1

1 Answers

1
votes

In regression problems, mse or rmse is a way to identify how good your model is doing. Low rmse or mse is preferred. So, go with the model which gives the lowest mse or rmse value and try it on test data. Ensemble methods often give the best results. XGBoost is often used in competitions.

There might be a case of overfitting where you might get very low rmse in training data but high rmse in test data. Thus, it is considered a good practice to use cross-validation.

You might want to check it: https://stats.stackexchange.com/questions/56302/what-are-good-rmse-values