How should I handle multiple choice/response(check-all-that-apply) data?

Question

I am working on some survey data I'm interested in and I encountered a minor issue. There is some questions that ask respondents to check top 3 out of the candidates...

For example,

this is a list of fruits that you can choose 3 out of. 1) Banana 2) Apple 3) Grapefruit 4) Peach 5) Watermelon
and then, multiple respondents gave different answers to this question.
- respondent a - 1, 3, 4 (banana, grapefruit, peach)
- respondent b - 1, 2, 5
- respondent c - 3, 4
- (and so on)
and whoever in charge of cleaning the survey data formed this into three columns and each column represents one of three choices the respondent made.

     Q1_1  Q2_2  Q3_3
    a   1     3     4
    b   1     2     5
    c   3     4     NA

My question is... is there any way I can make it into a one column? I know I can dummify them and make 5 columns corresponding to the types of fruits up there...

        Banana Apple Grapefruit Peach Watermelon
      a    1     0        1       1        0
      b    1     1        0       0        1
      c    0     0        1       1        0

However, I am afraid this might hurt the accuracy of the predictive model that I expect to run later of my research. One of the questions provided the respondents around 990 options they can choose. If I stick to the dummification, the dimension of the data will increase significantly...

Please let me know if there is any good way! I also would love to know if there is any R package that is specified to be used in this type of matter.

Please make this question reproducible. This includes sample code (including listing non-base R packages), sample unambiguous data (e.g., dput(head(x)) or data.frame(x=...,y=...)), and expected output. Refs: stackoverflow.com/questions/5963269, stackoverflow.com/help/mcve, and stackoverflow.com/tags/r/info — kstew

kstew kstew · Accepted Answer · 2019-07-17T04:34:33

I would suggest using dplyr and gather() to transform your three fruit variables into a single long variable. Note, in my toy example, each respondent may have duplicated fruit responses from sample(), so I remove duplicated rows.

df <- data.frame(id=1:100,
                 fruit1=sample(c('banana','apple','grape','peach','watermelon'),100,T),
                 fruit2=sample(c('banana','apple','grape','peach','watermelon'),100,T),
                 fruit3=sample(c('banana','apple','grape','peach','watermelon'),100,T),
                 outcome=runif(100))

# find respondents with duplicated fruits (eg, putting apple twice)
dupl <- df %>% gather(k,v,-id,-outcome) %>% 
  count(id,v)

# only keep one of the duplicated rows
df1 <- df %>% gather(k,v,-id,-outcome) %>% left_join(dupl) %>% 
  group_by(id,v,n) %>% slice(1) %>% select(-n)

lm(outcome~v,df1)

Call:
lm(formula = outcome ~ v, data = df1)

Coefficients:
(Intercept)      vbanana       vgrape       vpeach  vwatermelon  
   0.482981    -0.023715     0.020129     0.008117    -0.053460

How should I handle multiple choice/response(check-all-that-apply) data?

1 Answers