I am using R package randomForest and to understand variable importance we can investigate varImpPlot which shows Mean decrease Gini. I have studied Random Forest in detail and am well aware of how this model works in detail, there is something that I am unable to completely understand regarding how Mean decrease Gini is calculated, or rather why it is dependent on the size of the population.
When we have calculated Gini index we are able to aggregate the mean decrease Gini by the following formula (divided by number of trees):
I understand that there will be more number of splits in each tree when having a larger population, but shouldn't these splits on average have very small decreases in Gini index?
Here is example code showing what I mean (as expected, number of trees does not affect mean decrese Gini but population has a huge effect and seems to be more or less linear with population size):
install.packages("randomForest")
library(randomForest)
set.seed(1)
a <- as.factor(c(rep(1, 20), rep(0, 30)))
b <- c(rnorm(20, 5, 2), rnorm(30, 4, 1))
c <- c(rnorm(25, 0, 1), rnorm(25, 1, 2))
data <- data.frame(a = a, b = b, c = c)
rf <- randomForest(data = data, a ~ b + c, importance = T, ntree = 300)
varImpPlot(rf)
a2 <- as.factor(c(rep(1, 200), rep(0, 300)))
b2 <- c(rnorm(200, 5, 2), rnorm(300, 4, 1))
c2 <- c(rnorm(250, 0, 1), rnorm(250, 1, 2))
data2 <- data.frame(a2 = a2, b2 = b2, c2 = c2)
rf2 <- randomForest(data = data2, a2 ~ b2 + c2, importance = T, ntree =
300)
varImpPlot(rf2)
a3 <- as.factor(c(rep(1, 2000), rep(0, 3000)))
b3 <- c(rnorm(2000, 5, 2), rnorm(3000, 4, 1))
c3 <- c(rnorm(2500, 0, 1), rnorm(2500, 1, 2))
data3 <- data.frame(a3 = a3, b3 = b3, c3 = c3)
rf3 <- randomForest(data = data3, a3 ~ b3 + c3, importance = T, ntree =
300)
varImpPlot(rf3)
Resulting in these following plots, where we see that the x-axis increases approximately 10x for each increase in population:
My guess is that there is a weight based on number of people in each split conducted That is, a split that is made in first nodes that splits 1000 people weights heavier than a split that is conducted further down the tree with say 10 people, I can't find this in any literature though since it seems that all calculations are made by taking fractions of population into regard rather than absolute numbers.
What is it that I am missing?