A question about recoding multiple factor levels simultaneously in R

Question

Data manipulation is a breeze with the amazing packages like plyr and dplyr. Recoding factor levels, which could prove to be a daunting task especially for variables that have many levels, could easily be done with these packages. However, it is important for those learning Data Science to understand how the basic R works.

I seek help from R specialists about recoding factors using the base R. My question is about why one notation works while the other doesn’t in R.

I generate a vector with five categories and 300 observations. I convert the vector to a factor and generate the following tabulation.

x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)

> table(x)
a  b  c  d  e  f 
57 58 51 45 45 44 

> table(as.numeric(x))
 1  2  3  4  5  6 
57 58 51 45 45 44

Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.

x[as.numeric(x) %in% c(1,6)] <- NA
> table(factor(x))
b  c  d  e 
58 51 45 45

Where 1 and 6 corresponding to a and f.

Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.

So far so good.

Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, work, but it didn’t. It returns varying and erroneous answers.

# Recode and a and f as grades
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)
table(as.numeric(x))
levels(x)[as.numeric(x) %in% c(1,6)] <- "grades"
table(factor(x))
 a      b      c grades      e      f 
46     46     56     52     42     58

However, when I refer to levels explicitly, the script works as intended. See the script below.

x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x); table(x)
my.list = c("a", "f")
levels(x)[levels(x) %in% my.list] <- "grades"
table(factor(x)) 
grades      b      c      d      e 
   110     49     40     45     56

Hence the question is why one method works and the other doesn’t?

Ronak Shah Ronak Shah · Accepted Answer · 2018-10-08T05:42:44

set.seed(123)
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)
table(as.numeric(x))

# 1  2  3  4  5  6 
#44 55 56 49 48 48

Now, when you are trying to change levels

length(as.numeric(x) %in% c(1,6)) #gives
#[1] 300

whereas

length(levels(x)) #is just
#[1] 6

Next, when you do

as.numeric(x) %in% c(1,6) #it returns a vector of length 300
#[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE.......

So now, when you do

levels(x)[as.numeric(x) %in% c(1,6)]
#[1] "d" "e" "f" NA  NA  NA  NA  NA  NA  NA .....

with remaining all of them as NAs as there are no more levels to select from.

So,

levels(x)[as.numeric(x) %in% c(1,6)] <- "grades"

changes "d", "e" and "f" to "grades"

table(x)
#x
# a      b      c grades 
#44     55     56    145

but that is not what you intended.

In your second attempt

levels(x)[levels(x) %in% my.list]

it works because

length(levels(x))
#[1] 6

A question about recoding multiple factor levels simultaneously in R

2 Answers