2
votes

I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.

I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.

MWE:

library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))

Output:

> df
   id year varA varB varC
1   1 1995    A <NA> <NA>
2   2 1995    C <NA>    C
3   3 1995    A <NA> <NA>
4   4 1995    C <NA>    C
5   5 1995    B <NA> <NA>
6   6 1996 <NA>    B <NA>
7   7 1996 <NA>    A <NA>
8   8 1996 <NA>    C    C
9   9 1996 <NA>    A <NA>
10 10 1996 <NA>    B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.

Output:

> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
   id year varA varB varC
1   1 1995    A <NA>    D
2   2 1995    C <NA>    C
3   3 1995    A <NA>    D
4   4 1995    C <NA>    C
5   5 1995    B <NA>    D
6   6 1996 <NA>    B <NA>
7   7 1996 <NA>    A <NA>
8   8 1996 <NA>    C <NA>
9   9 1996 <NA>    A <NA>
10 10 1996 <NA>    B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

Desired output:

> df
   id year varA varB varC
1   1 1995    A <NA>    D
2   2 1995    C <NA>    C
3   3 1995    A <NA>    D
4   4 1995    C <NA>    C
5   5 1995    B <NA>    D
6   6 1996 <NA>    B    D
7   7 1996 <NA>    A    D
8   8 1996 <NA>    C    C
9   9 1996 <NA>    A    D
10 10 1996 <NA>    B    D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

How do I get what I'm looking for?

To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!

EDIT: This is what I meant by successfully combining them with nested ifelses

> df %>% mutate(varC = ifelse(year == 1995, as.character(varA), 
+                             ifelse(year == 1996, as.character(varB), NA)))
   id year varA varB varC
1   1 1995    A <NA>    A
2   2 1995    C <NA>    C
3   3 1995    A <NA>    A
4   4 1995    C <NA>    C
5   5 1995    B <NA>    B
6   6 1996 <NA>    B    B
7   7 1996 <NA>    A    A
8   8 1996 <NA>    C    C
9   9 1996 <NA>    A    A
10 10 1996 <NA>    B    B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>
2
Look at the output of df$varA == "C" | df$varB == "C"JasonAizkalns
Huh. I shoulda thought of checking that. I couldn't do it with my original data set (1 mil+ observations) but with the MWE it was doable. Thanks! Question still stands about how to fix it, though.Ron Levitin
watch that NA=="C" returns NA, while NA %in% "C" is FALSEKhashaa

2 Answers

2
votes

R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false. i.e. NA>0 = NA rather than FALSE

NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.

Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE

In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.

So here's a way to hack this:

ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))

How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.

When using a condition that involves x, but x can be NA, I like to use ((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.

EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.

EDIT2: This should output the NAs as you want:

ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))
1
votes

Per @Khashaa comment. This should do the trick and get you to the desired output.

df %>%
  mutate(varC = ifelse(is.na(varA) & is.na(varB), NA, 
                       ifelse(varA %in% "C" | varB %in% "C", "C", "D")))