1
votes

Data manipulation is a breeze with the amazing packages like plyr and dplyr. Recoding factor levels, which could prove to be a daunting task especially for variables that have many levels, could easily be done with these packages. However, it is important for those learning Data Science to understand how the basic R works.

I seek help from R specialists about recoding factors using the base R. My question is about why one notation works while the other doesn’t in R.

I generate a vector with five categories and 300 observations. I convert the vector to a factor and generate the following tabulation.

x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)

> table(x)
a  b  c  d  e  f 
57 58 51 45 45 44 

> table(as.numeric(x))
 1  2  3  4  5  6 
57 58 51 45 45 44

Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.

x[as.numeric(x) %in% c(1,6)] <- NA
> table(factor(x))
b  c  d  e 
58 51 45 45 

Where 1 and 6 corresponding to a and f.

Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.

So far so good.

Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, work, but it didn’t. It returns varying and erroneous answers.

# Recode and a and f as grades
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)
table(as.numeric(x))
levels(x)[as.numeric(x) %in% c(1,6)] <- "grades"
table(factor(x))
 a      b      c grades      e      f 
46     46     56     52     42     58

However, when I refer to levels explicitly, the script works as intended. See the script below.

x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x); table(x)
my.list = c("a", "f")
levels(x)[levels(x) %in% my.list] <- "grades"
table(factor(x)) 
grades      b      c      d      e 
   110     49     40     45     56

Hence the question is why one method works and the other doesn’t?

2

2 Answers

0
votes
set.seed(123)
x <- sample(c("a", "b", "c", "d", "e", "f"), 300, replace = TRUE)
x <-factor(x)
table(as.numeric(x))

# 1  2  3  4  5  6 
#44 55 56 49 48 48 

Now, when you are trying to change levels

length(as.numeric(x) %in% c(1,6)) #gives
#[1] 300

whereas

length(levels(x)) #is just
#[1] 6

Next, when you do

as.numeric(x) %in% c(1,6) #it returns a vector of length 300
#[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE.......

So now, when you do

levels(x)[as.numeric(x) %in% c(1,6)]
#[1] "d" "e" "f" NA  NA  NA  NA  NA  NA  NA .....

with remaining all of them as NAs as there are no more levels to select from.

So,

levels(x)[as.numeric(x) %in% c(1,6)] <- "grades"

changes "d", "e" and "f" to "grades"

table(x)
#x
# a      b      c grades 
#44     55     56    145 

but that is not what you intended.

In your second attempt

levels(x)[levels(x) %in% my.list]

it works because

length(levels(x))
#[1] 6
0
votes

What do you want to achieve?

Manipulating factors by using as.numeric() is not a good idea and you may have surprises. May favorite way is to avoid factors whenever possible (using e.g. stringsAsFactors=FALSE when creating data frames and as.is=TRUE with read.csv and read.table -- as.is because the opposite is as.it.is.not). Manipulating character vectors is much more straightworward and less error prone than any operations with factors, and when a factor is, technically needed, in many cases the analysis functions take care of it -- or if that's not enough, it is often easier to create a factor (with levels) on the fly, with an appropriate ordering and labeling of levels, than to worry about all the confusions related to factors.

So what happens in ..

 levels(x)[as.numeric(x) %in% c(1,6)]

levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you're trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a "switch", TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don't. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters.)

> which(as.numeric(x) %in% c(1,6))
 [1]   4   9  10  12  14  16  24  35  37  44  47  52  54  57  58  61  63  69  79  81  82  83
[23]  84  86  87  89  91  92  99 100 103 109 114 121 124 125 129 134 135 138 140 141 143 147
[45] 154 167 178 179 181 187 188 194 201 212 213 214 217 218 219 220 222 232 235 237 239 245
[67] 254 255 258 260 263 265 266 267 275 278 281 286 294 295 296

If you want to replace some levels by referring to their numeric equivalent, you don't need as.numeric at all:

 levels(x)[c(1,6)] <- "grades"

 > levels(x)[c(1,6)] <- "grades"
 > table(x)
 x
 grades      b      c      d      e 
    101     45     46     62     46

"a" and "f" have been replaced by "grades" as you wanted. Whereas with "as.numeric" above, you thought of levels 1 and 6, but actually asked only level 4 to be changed. (which level[s] exactly,is up to the RNG and not directly under your control).