1
votes

I want to train a random forest model on a dataset with large missingness. I am aware of the 'standard method', where we impute missing data in the training set, use the same imputation rules to impute the test set, then train a random forest model on the imputed training set and use the same model to predict on the test set (potentially doing it with multiple imputation).

What I want to understand is the difference to the following method which I would like to use:

Subset the dataset according to missing patterns. Train random forest models for each of the missing patterns. Use the random forest model trained on missing pattern A to predict data from the test set with missing pattern A. Use the model trained on pattern B to predict data from the test set with pattern B etc.

What is the name for this method? What are the statistical advantages or disadvantages of the two methods? I would very much appreciate if someone could direct me to some literature on the second method, or the comparison of the two.

1
What exactly do you mean with 'missing patterns' ?Steffen Moritz
Combinations of missing variables: Let's say we have variables a,b,c,d. For example 10 records could have missingess in a and b. This is the first missing pattern. 7 records have missingess in a and c. This is another missing pattern... etcBrigitte

1 Answers

0
votes

The difference in methods is in prediction capability.

If you will train different models according to different missing patterns it will be trained on a lower amount of the data (due to missing pattern separation) and will be used to predict only the corresponding test set. Using this approach you can easily miss common patterns in your data for all of your dataset, which otherwise (using all the data) you would detect.

It still heavily depends on your particular case and your data. The good test that will check if your models trained due to particular missing patterns generalize well will be taking another missing pattern dataset, do simple and fast imputation in it (mean/mode/median e.t.c) and check the difference in the metric.

In my opinion, this approach sounds a little extreme as you are voluntarily cutting your train dataset into much smaller parts, than it could be. Maybe, it could perform better on large amounts of data, where your train dataset reduction doesn't hurt your model performance much.

About the articles - I don't know any articles, that compare these two approaches, but can suggest some good ones about various "standard "imputation approaches: