5
votes

In the R-devel list, Martin Maechler posted a message about duplicated levels in factors

"factors with non-unique (duplicated) levels have been deprecated since 2009 -- are more deprecated now ..." June 4, 2016

It states that in R 3.4, scheduled for April 2017, duplicated levels will cause an error, no longer just a warning.

I wonder why does the levels function not cause a similar warning? Here I combine the first three levels as "a" in two ways, one deprecated.

Example

> x <- c("a", "b", "c", "d")
> xf <- factor(x, levels = c("a", "b", "c", "d"), 
    labels = c("a", "a", "a", "d"))
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) 
    as.character(labels) else paste0(labels,  :
    duplicated levels in factors are deprecated
> xf <- factor(x)
> levels(xf) <- c("a", "a", "a", "d")
> xf
[1] a a a d
Levels: a d

I would like to understand why the latter is treated differently by R than the former.

This is the documented behavior of levels, I'm not exploiting an unstated element. In ?levels, there is an example in which duplicated levels are allowed. I'll paste it in to save you the lookup.

## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
1

1 Answers

0
votes

Factors are used to create categorical variables. The Levels attribute of this variable represents the different categories. A variable cannot have duplicate category. It does not make sense. However, a variable can have duplicate data values of the same category.

The data inside a categorical variable is represented as integer vector. Use unclass to see the integer vector. The levels attribute represents the categories of this variable. For example the first value of this variable belongs to a particular category and it will be assigned number 1. If it is an ordered factor, then the lowest category will be assigned number 1.

x <- c(letters[1:3], letters[1:3])
xf <- factor(x)

xf
# [1] a b c a b c
# Levels: a b c

attributes(xf)
# $levels
# [1] "a" "b" "c" 
# 
# $class
# [1] "factor"

unclass(xf)
# [1] 1 2 3 1 2 3
# attr(,"levels")
# [1] "a" "b" "c"

If a category does not have values in a variable, then it will be assigned with NA.

factor(c("a", "b", "c"), levels = c("e", "f", "g"))
# [1] <NA> <NA> <NA>
#   Levels: e f g

labels is an optional argument used to change the name of the category. If the variable has data values according to the levels argument then the value in the labels argument will be given to it. Notice the value "e" is given the category "h".

factor(c("a", "b", "e"), levels = c("e", "f", "g"), labels = c("h", "i", "j"))
# [1] <NA> <NA> h   
# Levels: h i j

Now levels() is a replacement function used to change the data present inside a factor variable. The data used in the levels() function must correspond to the factor variable. Otherwise garbage is created.

xf
# [1] a b c a b c
# Levels: a b c

The values with "a" is changed to "e", "b" to "f", "c" to "g". This example shows how to properly convert the category names of a factor variable.

levels(xf) <- c("e", "f", "g", "e", "f", "g")
> xf
# [1] e f g e f g
# Levels: e f g

Now the garbage type: Notice that the data does not correspond to the factor variable xf. To see the integer vector, use unclass(xf).

levels(xf) <- c("m", "m", "m", "n", "n", "n")
xf
# [1] m m m m m m
# Levels: m n