How can I get the probability density function from a regression random forest?

Question

I am using random-forest for a regression problem to predict the label values of Test-Y for a given set of Test-X (new values of features). The model has been trained over a given Train-X (features) and Train-Y (labels). "randomForest" of R serves me very well in predicting the numerical values of Test-Y. But this is not all I want.

Instead of only a number, I want to use random-forest to produce a probability density function. I searched for a solution for several days and here is I found so far:

"randomForest" doesn't produce probabilities for regression, but only in classification. (via "predict" and setting type=prob).
Using "quantregForest" provides a nice way to make and visualize prediction intervals. But still not the probability density function!

Any other thought on this?

quantregForest does provide a probability density, it's the ecdf you can predict. — catastrophic-failure

user1808924 user1808924 · Accepted Answer · 2016-02-19T19:16:55

Please see the predict.all parameter of the predict.randomForest function.

library("ggplot2")
library("randomForest")

data(mpg)
rf = randomForest(cty ~ displ + cyl + trans, data = mpg)

# Predict the first car in the dataset
pred = predict(rf, newdata = mpg[1, ],  predict.all = TRUE)
hist(pred$individual)

The histogram of 500 "elementary" predictions looks like this:

How can I get the probability density function from a regression random forest?

2 Answers