Data manipulation is a breeze with the amazing packages like plyr and dplyr. Recoding factor levels, which could prove to be a daunting task especially for variables that have many levels, could easily be done with these packages. However, it is important for those learning Data Science to understand how the basic R works.
I seek help from R specialists about recoding factors using the base R. My question is about why one notation works while the other doesn’t in R.
I generate a vector with five categories and 300 observations. I convert the vector to a factor and generate the following tabulation.
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)
> table(x)
a b c d e f
57 58 51 45 45 44
> table(as.numeric(x))
1 2 3 4 5 6
57 58 51 45 45 44
Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.
x[as.numeric(x) %in% c(1,6)] <- NA
> table(factor(x))
b c d e
58 51 45 45
Where 1 and 6 corresponding to a and f.
Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.
So far so good.
Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, work, but it didn’t. It returns varying and erroneous answers.
# Recode and a and f as grades
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)
table(as.numeric(x))
levels(x)[as.numeric(x) %in% c(1,6)] <- "grades"
table(factor(x))
a b c grades e f
46 46 56 52 42 58
However, when I refer to levels explicitly, the script works as intended. See the script below.
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x); table(x)
my.list = c("a", "f")
levels(x)[levels(x) %in% my.list] <- "grades"
table(factor(x))
grades b c d e
110 49 40 45 56
Hence the question is why one method works and the other doesn’t?