1
votes

A dataset I am using recorded respondents' ethnicity. Responses are recorded across multiple variables, and respondents were allowed to pick more than one. Example:

Black White Asian Hispanic 

1     NA    NA    NA

NA    1     NA    NA

NA    NA    NA    1

NA    NA    1     1

^^^In the last row, the respondent would have chosen Asian and Hispanic.

What I want to do is:

A) collapse these columns into a single ethnicity variable, with different numbers representing different ethnicities (i.e., black would be 1, white would be 2 etc.)

B) have it so that anyone who reported multiple columns gets designated "multiple".

I'm a bit of an R novice, so any help would be greatly appreciated!

2

2 Answers

1
votes

One way to do this is to pivot your column names as a column, group values by respondent, then drop the NA values. Then just choose the ethnicity value that remains for each group, switching to "multiple" when necessary. Here's a way to do that with tidyverse:

library(tidyverse)

df %>% 
  rownames_to_column("respondent") %>% 
  pivot_longer(-respondent) %>% 
  group_by(respondent) %>% 
  filter(!is.na(value)) %>% 
  summarise(eth = ifelse(n() == 1, name, "multiple"))

# A tibble: 4 x 2
  respondent eth     
  <chr>      <chr>   
1 1          Black   
2 2          White   
3 3          Hispanic
4 4          multiple

You won't be able to store numbers, as numeric types, with a string like "variable" - so you have a choice. Either stick with the ethnicity labels (like the solution above), or convert labels to numbers and then numbers to the string representations of those numbers. That seems a little unwieldy, but if you want to do that, here's how:

df %>% 
  rownames_to_column("respondent") %>% 
  pivot_longer(-respondent) %>% 
  mutate(eth_num = as.character(as.numeric(fct_inorder(name)))) %>% 
  group_by(respondent) %>% 
  filter(!is.na(value)) %>% 
  summarise(eth = ifelse(n() == 1, eth_num, "multiple"))

# A tibble: 4 x 2
  respondent eth     
  <chr>      <chr>   
1 1          1       
2 2          2       
3 3          4       
4 4          multiple
0
votes

Here's a base R option -

#Get the name of column with non-NA value
ethinicty <- names(df)[max.col(!is.na(df))]
#If there are more than 1 option selected change it to 'multiple'
ethinicty[rowSums(!is.na(df)) > 1] <- 'multiple'
ethinicty

#[1] "Black"    "White"    "Hispanic" "multiple"