4
votes

Is it possible to compare two factors of same length, but different levels? For example, if we have these 2 factor variables:

A <- factor(1:5)

str(A)
 Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5

B <- factor(c(1:3,6,6))

str(B)
 Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4

If I try to compare them using, for example, the == operator:

mean(A == B)

I get the following error:

Error in Ops.factor(A, B) : level sets of factors are different

2
Will you please explain what is meant by compare two factors. It is not clear to me. - user2100721
@user2100721 I am assuming they want to know the proportion of overlap. In example from my post, overlap is 3 out of 5, 3/5 = 0.6. Note that TRUE/FALSE is converted implicitly to 1/0, i.e.: TRUE + TRUE = 2. - zx8754
@zx8754 Thanks. Got your point. - user2100721
@zx8754 Sorry for the noise, I forgot to wrap with factor earlier. With microbenchmark, your solution is almost 2 times faster which is kind of surprising. - akrun
@zx8754 I am not at all concerned with rep :-) - akrun

2 Answers

11
votes

Convert to character then compare:

# data
A <- factor(1:5)
B <- factor(c(1:3,6,6))

str(A)
# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
str(B)
# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4

mean(A == B)

Error in Ops.factor(A, B) : level sets of factors are different

mean(as.character(A) == as.character(B))
# [1] 0.6

Or another approach would be

mean(levels(A)[A] == levels(B)[B])

which is 2 times slower on a 1e8 dataset.

2
votes

Converting to character as in @zx8754's answer is the easiest solution to this problem, and probably the one you'd want to use almost always. Another option, though, is to correct the 2 variables so that they have the same levels. You might want to do this if you want to keep these variables as factor for some reason and don't want to have to clog up your code with repeated calls to as.character.

A <- factor(1:5)
B <- factor(c(1:3,6,6))

mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different

We can take the union of the levels of both factors to get all levels in either factor, and then set remake the factors using that union as the levels. Now, even though the 2 factors have different values, the levels are the same between them and you can compare them:

C = factor(A, levels = union(levels(A), levels(B)))
D = factor(B, levels = union(levels(A), levels(B)))

mean(C==D)
[1] 0.6

As you can see, the values are unchanged, but the levels are now identical.

C
[1] 1 2 3 4 5
Levels: 1 2 3 4 5 6

D
[1] 1 2 3 6 6
Levels: 1 2 3 4 5 6