Multiple-choice in R: how to tidy survey data using dplyr/tidyr?

Question

I have been using gather() from the tidyr R package to tidy my survey data.

I wonder whether there is a way in which to deal with multiple choice questions when tidying data?

This question is not about a specific error, but more about what strategy is most fitting.

Imagine the following tibble:

tb1 <- tribble(~id,~x1,~x2,~x3,~y1,~y2,~z,
               "Harry",1,1,NA,NA,1,"No",
               "Jess",NA,1,1,1,1,"Yes",
               "George",NA,NA,1,NA,1,"No")

When gathering this multiple question result, I get (logically), multiple rows for 'Harry', 'Jess' and 'George':

tb1 %>%
  gather(X,val,x1:x3,-id,-z) %>%
  filter(!is.na(val)) %>%
  select(-val) %>%
  gather(Y,val,y1:y2,-id,-X,-z) %>%
  filter(!is.na(val)) %>%
  select(-val) 

# A tibble: 7 x 4
  id     z     X     Y    
  <chr>  <chr> <chr> <chr>
1 Jess   Yes   x2    y1   
2 Jess   Yes   x3    y1   
3 Harry  No    x1    y2   
4 Harry  No    x2    y2   
5 Jess   Yes   x2    y2   
6 Jess   Yes   x3    y2   
7 George No    x3    y2

I'm a bit worried about the multiple entries, and was wondering whether there's a good strategy to deal with multiple choice questions of a survey with binary columns that need to be gathered.

In the end, I'd like to be able to plot and analyse the values of various variables: i.e. the amount of times that people selected y2.

It seems that this long format is not practical to analyse this, as the count() will go up for all of Harry's double mentions of y2.

The flow of questions I have regarding this topic is as follows:

Would it be better/easier for analysis to gather multiple responses into a single column?
If yes, how do you do this efficiently?
If no, what are the implications that I have to watch out for in further analysis when I keep the multi-responses in long format?
and how do you incorporate those implications into your code? (Maybe a specific "group" argument for id? Could you show me an example?)

Humpelstielzchen Humpelstielzchen · Accepted Answer · 2019-05-08T13:02:59

I think the easiest way is definitly to gather all the reponses in one column.

library(tidyverse)

tb1 %>%   
  spread(key = z, value = z, sep = "_") %>%
  gather(key = "Question", value = "Answer", -id, na.rm = T) %>%
  select(-Answer) -> reshape_tb1

> reshape_tb1
# A tibble: 12 x 2
   id     Question
   <chr>  <chr>   
 1 Harry  x1      
 2 Harry  x2      
 3 Jess   x2      
 4 George x3      
 5 Jess   x3      
 6 Jess   y1      
 7 George y2      
 8 Harry  y2      
 9 Jess   y2      
10 George z_No    
11 Harry  z_No    
12 Jess   z_Yes

This way you can easily feed it to ggplot2

ggplot(reshape_tb1) +
  geom_bar(aes(x = Question))

Multiple-choice in R: how to tidy survey data using dplyr/tidyr?

1 Answers