0
votes

I am working on some survey data I'm interested in and I encountered a minor issue. There is some questions that ask respondents to check top 3 out of the candidates...

For example,

  • this is a list of fruits that you can choose 3 out of. 1) Banana 2) Apple 3) Grapefruit 4) Peach 5) Watermelon

  • and then, multiple respondents gave different answers to this question.

    • respondent a - 1, 3, 4 (banana, grapefruit, peach)
    • respondent b - 1, 2, 5
    • respondent c - 3, 4
    • (and so on)
  • and whoever in charge of cleaning the survey data formed this into three columns and each column represents one of three choices the respondent made.

     Q1_1  Q2_2  Q3_3
    a   1     3     4
    b   1     2     5
    c   3     4     NA
  • My question is... is there any way I can make it into a one column? I know I can dummify them and make 5 columns corresponding to the types of fruits up there...
        Banana Apple Grapefruit Peach Watermelon
      a    1     0        1       1        0
      b    1     1        0       0        1
      c    0     0        1       1        0
  • However, I am afraid this might hurt the accuracy of the predictive model that I expect to run later of my research. One of the questions provided the respondents around 990 options they can choose. If I stick to the dummification, the dimension of the data will increase significantly...

Please let me know if there is any good way! I also would love to know if there is any R package that is specified to be used in this type of matter.

1
Please make this question reproducible. This includes sample code (including listing non-base R packages), sample unambiguous data (e.g., dput(head(x)) or data.frame(x=...,y=...)), and expected output. Refs: stackoverflow.com/questions/5963269, stackoverflow.com/help/mcve, and stackoverflow.com/tags/r/info - kstew

1 Answers

1
votes

I would suggest using dplyr and gather() to transform your three fruit variables into a single long variable. Note, in my toy example, each respondent may have duplicated fruit responses from sample(), so I remove duplicated rows.

df <- data.frame(id=1:100,
                 fruit1=sample(c('banana','apple','grape','peach','watermelon'),100,T),
                 fruit2=sample(c('banana','apple','grape','peach','watermelon'),100,T),
                 fruit3=sample(c('banana','apple','grape','peach','watermelon'),100,T),
                 outcome=runif(100))

# find respondents with duplicated fruits (eg, putting apple twice)
dupl <- df %>% gather(k,v,-id,-outcome) %>% 
  count(id,v)

# only keep one of the duplicated rows
df1 <- df %>% gather(k,v,-id,-outcome) %>% left_join(dupl) %>% 
  group_by(id,v,n) %>% slice(1) %>% select(-n)

lm(outcome~v,df1)

Call:
lm(formula = outcome ~ v, data = df1)

Coefficients:
(Intercept)      vbanana       vgrape       vpeach  vwatermelon  
   0.482981    -0.023715     0.020129     0.008117    -0.053460