3
votes

I am using random-forest for a regression problem to predict the label values of Test-Y for a given set of Test-X (new values of features). The model has been trained over a given Train-X (features) and Train-Y (labels). "randomForest" of R serves me very well in predicting the numerical values of Test-Y. But this is not all I want.

Instead of only a number, I want to use random-forest to produce a probability density function. I searched for a solution for several days and here is I found so far:

  1. "randomForest" doesn't produce probabilities for regression, but only in classification. (via "predict" and setting type=prob).

  2. Using "quantregForest" provides a nice way to make and visualize prediction intervals. But still not the probability density function!

Any other thought on this?

2
quantregForest does provide a probability density, it's the ecdf you can predict.catastrophic-failure

2 Answers

4
votes

Please see the predict.all parameter of the predict.randomForest function.

library("ggplot2")
library("randomForest")

data(mpg)
rf = randomForest(cty ~ displ + cyl + trans, data = mpg)

# Predict the first car in the dataset
pred = predict(rf, newdata = mpg[1, ],  predict.all = TRUE)
hist(pred$individual)

The histogram of 500 "elementary" predictions looks like this:enter image description here

0
votes

You can also use quantregForest with a very fine grid of quantiles, convert them into a "cumulative distribution function (cdf)" with R-function ecdf and convert this cdf into a density estimation with a kernel density estimator.