1
votes

I am trying to recode a race/ethnicity variable derived from a Hispanic variable and a 6 other race variables.

I've tried some of these methods. I can't quite figure out how to keep all my different levels for each factor.

The Hispanic variable has 5 levels: 0:NA, 1:yes, 2:no, 3:unable to determine, 99:missing.

Each race variable has 3 levels: 0:does not apply, 1:applies, 99:missing.

My new p1raceeth variable will have 7 levels: 0:Unknown, 1:Black,NH, 2:Hispanic any race, 3:Other, 4:White,NH, 99:missing.

I had tried also coding it with the code below and it works for the one race:amakn variable but then when I move on the the next race variable with the same code it overwrites the past recode. Any suggestions would be very helpful. Including suggestions on how to collapse the factor levels within the race and Hispanic variables to make this more manageable.

What prompted this was I was trying to combine asian with nhopi. Quite the rabbit hole.

d_a$p1raceeth <- "0"
d_a$p1raceeth[d_a$Hispanic=="0" & d_a$amakn=="0"] <- "0"
d_a$p1raceeth[d_a$Hispanic=="0" & d_a$amakn=="1"] <- "0"
d_a$p1raceeth[d_a$Hispanic=="0" & d_a$amakn=="99"] <- "0"

d_a$p1raceeth[d_a$Hispanic=="1" & d_a$amakn=="0"] <- "2"
d_a$p1raceeth[d_a$Hispanic=="1" & d_a$amakn=="1"] <- "2"
d_a$p1raceeth[d_a$Hispanic=="1" & d_a$amakn=="99"] <- "99"

d_a$p1raceeth[d_a$Hispanic=="2" & d_a$amakn=="0"] <- "99"
d_a$p1raceeth[d_a$Hispanic=="2" & d_a$amakn=="1"] <- "3"
d_a$p1raceeth[d_a$Hispanic=="2" & d_a$amakn=="99"] <- "99"

d_a$p1raceeth[d_a$Hispanic=="3" & d_a$amakn=="0"] <- "0"
d_a$p1raceeth[d_a$Hispanic=="3" & d_a$amakn=="1"] <- "3"
d_a$p1raceeth[d_a$Hispanic=="3" & d_a$amakn=="99"] <- "99"

d_a$p1raceeth[d_a$Hispanic=="99" & d_a$amakn=="0"] <- "99"
d_a$p1raceeth[d_a$Hispanic=="99" & d_a$amakn=="1"] <- "3"
d_a$p1raceeth[d_a$Hispanic=="99" & d_a$amakn=="99"] <- "99"

Here is a sample of my data:

df <- read.table(text=
"Hispanic amakn asian blkaa nhopi white utod
1           1          0          0          0          0          1         0
2           2         99         99          1         99         99        99
3          99         99         99         99         99         99        99
4           3         99         99         99         99         99        99
5           0         99         99         99         99         99        99
6          99         99         99         99         99         99        99
7           3         99         99         99         99         99        99
8           0         99         99         99         99         99        99
9           2          0          0          0          0          1         0
10          2          0          0          0          0          1         0
11          2          0          0          0          0          1         0
12          1          0          0          0          0          1         0
13          0         99         99         99         99         99        99
14          2          0          0          0          0          1         0
15          0         99         99         99         99         99        99
16          2          0          0          0          0          1         0
17          2          0          0          1          0          0         0
18          0          0          0          0          0          0         0
19         99         99         99         99         99         99        99
20          1         99         99         99         99         99        99
21          0         99         99         99         99         99        99
22          2          0          0          0          0          1         0
23          2          0          0          0          0          1         0
24          2          0          0          1          0          0         0
25          0         99         99         99         99         99        99
26         99          0          0          0          0          1         0
27          0         99         99         99         99         99        99
28         99          0          0          0          0          1         0
29          1         99         99         99         99         99        99
30         99         99         99         99         99         99        99
31          2          0          0          0          0          1         0
32          2          0          0          0          0          1         0
33          3          0          1          0          0          0         0
34          2         99         99         99         99          1        99
35          2          0          0          0          0          1         0
36          1         99         99         99         99         99        99
37          0         99         99         99         99         99        99
38          2          0          0          0          0          1         0
39         99         99         99         99         99         99        99
40          1         99         99         99         99         99        99
", header=TRUE)
2

2 Answers

2
votes

I'd recommend to code missings as NA which makes life easier.

d_a[] <- lapply(d_a, function(x) {x[x %in% 99] <- NA;x})
d_a$Hispanic[d_a$Hispanic %in% 0] <- NA

Then, using within, we go through the options one by one.

  1. Create index variable mis that identifies rows where all race variables are NA.
  2. Span an empty p1raceeth variable with all NA.
  3. Where Hispanic isn't NA and rowSums of other variables are zero we set "unknown".
  4. Where Hispanic is 1 and others are not in mis, we have "Hispanic any race". Hispanic in 2 or 3 accordingly.
  5. Set "White", where white is in 1, "Black" accordingly.
  6. It might be, that there are more than one race variables with the value 1, we might want to set those p1raceeth to NA (or something else), identifiable where rowSums w/o "Hispanic" are greater than 1.
  7. (If we want to, we set all NA to "missing", but I don't recommend this, since it would delete the information that it's NA, so I've commented it out.)
  8. Finally, we rmove the mis variable, to not to appear in the result.

res <- within(d_a, {
  mis <- apply(d_a[-1], 1, function(x) all(is.na(x)))
  p1raceeth <- NA
  p1raceeth[is.na(Hispanic) & rowSums(d_a[-1]) %in% 0] <- "unknown"
  p1raceeth[Hispanic %in% 1 & !mis] <- "Hispanic any race"
  p1raceeth[Hispanic %in% 2:3 & !mis] <- "Other"
  p1raceeth[Hispanic %in% 2 & white %in% 1] <- "White"
  p1raceeth[Hispanic %in% 2 & blkaa %in% 1] <- "Black"
  p1raceeth[rowSums(d_a[-1], na.rm=T) > 1] <- NA
  # p1raceeth[is.na(p1raceeth)] <- "missing"
  rm(mis)
})

Notice, that I used %in% here instead of (the probably more familiar) ==. That's important since == occasionally yields NA which we don't want here, whereas %in% doesn't.

If you need a 'factor' variable, you could optionally do now as a last step:

res$p1raceeth <- as.factor(res$p1raceeth)

Result

I show the unique rows of result, ordered by Hispanic.

unique(res[order(res$Hispanic), ])
#    Hispanic amakn asian blkaa nhopi white utod         p1raceeth
# 1         1     0     0     0     0     1    0 Hispanic any race
# 20        1    NA    NA    NA    NA    NA   NA              <NA>
# 2         2    NA    NA     1    NA    NA   NA             Black
# 9         2     0     0     0     0     1    0             White
# 17        2     0     0     1     0     0    0             Black
# 34        2    NA    NA    NA    NA     1   NA             White
# 4         3    NA    NA    NA    NA    NA   NA              <NA>
# 33        3     0     1     0     0     0    0             Other
# 3        NA    NA    NA    NA    NA    NA   NA              <NA>
# 18       NA     0     0     0     0     0    0           unknown
# 26       NA     0     0     0     0     1    0              <NA>
-1
votes

Perhaps, you can use ifelse/case_when and combine the conditions using %in% :

library(dplyr)

df %>%
  mutate(p1raceeth = case_when(Hispanic== 0 & amakn %in% c(0, 1, 99) ~ 0, 
                               Hispanic== 1 & amakn %in% c(0, 1) ~ 2, 
                               Hispanic %in% c(2, 3) & amakn == 1 ~ 3,
                               Hispanic == 3 & amakn == 0 ~ 0,
                               Hispanic == 99 & amakn == 1~ 3,
                               TRUE ~ 99))