I am trying to write a function that will look at all the factor variables in a data frame and combine all the levels together but only if a level contains less observations then a certain percentage/threshold.
So far I have a function that works on a single variable but I am trying to apply that function to all the factors in the data frame. When I try to apply it to all the factors variables, I get an error
# Code to create data frame (df)
var <- factor(c(a <- rep("a", 100), b <- rep("b", 1000), c <- rep("c", 1000), d <- rep("d", 1000), e <- rep("e", 400), f <- rep("f", 100)))
var1 <- factor(c(a1 <- rep("a", 100), b1 <- rep("b", 400), c1 <- rep("c", 1000), d1 <- rep("d", 1000), e1 <- rep("e", 1000), f <- rep("f", 100)))
x_df <- data.frame(var = var, var1 = var1)
str(x_df)
# check the count of each level
sapply(x_df, function(x){
table(x)
})
# create the function
Merge.factors <- function(x, p) {
#Combines factor levels in x that are less than a specified proportion, p.
t <- table(x)
less <- subset(t, prop.table(t) < p)
more <- subset(t, prop.table(t) >= p)
other <- rep("Other", sum(less))
new.table <- c(more, table(other))
new.x <- as.factor(rep(names(new.table), new.table))
return(new.x)
}
# applying the function to a single factor variable - It works!
# This is the expected result
Merge.factors(x_df$var, 0.15)
Now I have tried two ways to apply this function to all the factor variables
# First method:
sapply(x_df, Merge.factors(0.15)) # Give an error, argument P is missing
# 2nd Method:
for (i in 1:ncol(x_df)) {
x_df[,i] <- Merge.factors(i, 0.15)
}
Any help will be greatly appreciated
RStudiotag unless you have a specific problem with the RStudio IDE - Sotos