Filter for two-part response data from another dataframe and join two dataframes

Question

I have a survey question in the format: "Do you prefer a rose or a tulip? Imagine that the rose has colors V1 and V2, and the tulip has colors V3 and V4"

The actual colors are drawn from combinations contained in one dataframe:

Dataframe 1 (df1):

structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("red", "ruby"), class = "factor"), 
V2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), .Label = c("blue", "violet"), class = "factor"), 
V3 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 
2L, 2L, 1L, 1L, 2L, 2L), .Label = c("green", "turqoise"), class = "factor"), 
V4 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L), .Label = c("black", "yellow"), class = "factor")), .Names = c("V1", 
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -16L
))

In this dataframe (df1) the first two columns (V1 and V2) correspond to "rose", and the last two columns (V3 and V4) correspond to "tulip". For example, a respondent could be shown combination 1 from the first row of df1, which is "red blue green yellow". This means that the respondent could choose a "rose that is red and blue" or a "tulip that is green and yellow".

The choices made by respondents are contained in a separate dataframe (df2). df2 has one column per every single combination of colors. If respondent 1 was shown the first combination from df1 ("red blue green yellow") and selected a tulip (that is green and yellow), the choice is marked with "2" (for tulip, i.e. second flower) in the first row of df2. If respondent 2 was shown the second combination from df1 ("red blue green black") and selected a rose (that is red and blue), the choice is marked with "1" (for rose, i.e. first flower) in the second row of df2. In other words, "2" means "tulip chosen, rose not chosen", and 1" means "rose chosen, tulip not chosen".

Dataframe 2 (df2):

structure(list(respondentID = 1:16, v1 = c(2L, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v2 = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v3 = c(NA, 
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
    v4 = c(NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA, 1L, 2L, NA, 
    NA, NA, NA), v5 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA), v6 = c(NA, 2L, NA, NA, NA, NA, NA,
    NA, NA, 1L, NA, NA, NA, NA, NA, NA), v7 = c(NA, NA, NA, NA, 
    1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v8 = c(NA, 
    NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    ), v9 = c(NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA, NA, 
    NA, NA, NA, NA), v10 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), v11 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA), v12 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA
    ), v13 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, 1L, NA, NA), v14 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), v15 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), v16 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L
    )), .Names = c("respondentID", "v1", "v2", "v3", "v4", "v5",
"v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", 
"v16"), class = "data.frame", row.names = c(NA, -16L))

If I wanted only to know which flower was chosen and the colors, I could do it using:

df1_with_id <- df1 %>% 
  setNames(paste0("color", 1:4)) %>%
  mutate(combo = paste0("v", row_number()))

result_df <- df2 %>%
  gather(key = combo, value = val, -respondentID) %>%
  filter(!is.na(val)) %>%
  left_join(df1_with_id, by = "combo") %>%
  arrange(respondentID)

(As per this question)

But this doesn't give me the format I need. I need information on both options (i.e. "rose that is V1 and V2" and "tulip that is V3 and V4") shown to each respondent in separate rows and an additional variable that indicates choice between the two options, like this: Desired result

(In the image, "1" in the choice variable refers to option chosen by respondent, and "0" is the option not chosen.)

I can't quite figure out how to write code to organize data in this way. Any advice?

I realize I didn't explain the problem well at all. I've fixed it now. — KaC
Flower is simply rose or tulip. If a cell in the flower column is rose, then color1 and color2 contain information on the colors of the rose. The colors for tulip are in the row underneath. No respondent makes two choices; each respondent is presented with two options (rose with two colors and tulip with two colors) and chooses one of these options. choice refers to the choice made. If choice = 1, the flower in that row was selected; if choice = 0, the flower in that row was not selected by the respondent. — KaC
The extra value in respondentID = 2 was a typo. Sorry. Fixed it now. respondentID= 9 only has one response (unless I'm missing something). But regardless: every respondent can only make one choice. The other issue is precisely the trick: in df1 we have V1, V2, V3, and V4 all in one row. in result_df I need V1 and V2 in one row (next to rose) and V3 and V4 in the row underneath (next to tulip)--along with a column that indicates which of the flowers was chosen. — KaC

CJ Yetman CJ Yetman · Accepted Answer · 2018-05-10T17:35:58

The main problem here is that each of the columns in your df1 signify two bits of information: the flower type and the color number. So rename them to include both bits of info, gather them into one column, separate the key column into flower and color columns, and then spread the color column. Then you just need to convert the val into a 1 if it matches the flower column, or 0 otherwise.

df2 %>%
  gather(key = combo, value = val, -respondentID) %>%
  filter(!is.na(val)) %>%
  left_join(df1_with_id, by = "combo") %>%
  arrange(respondentID) %>% 
  rename(rose_color1 = color1, rose_color2 = color2,
         tulip_color1 = color3, tulip_color2 = color4) %>% 
  gather(color, value, rose_color1:tulip_color2) %>% 
  separate(color, into = c('flower', 'color')) %>% 
  spread(color, value) %>% 
  mutate(val = if_else(val == 1, 'rose', 'tulip')) %>% 
  mutate(val = if_else(val == flower, 1, 0)) %>% 
  select(respondentID, flower, color1, color2, choice = val)

Filter for two-part response data from another dataframe and join two dataframes

1 Answers