I have been using gather() from the tidyr R package to tidy my survey data.
I wonder whether there is a way in which to deal with multiple choice questions when tidying data?
This question is not about a specific error, but more about what strategy is most fitting.
Imagine the following tibble:
tb1 <- tribble(~id,~x1,~x2,~x3,~y1,~y2,~z,
"Harry",1,1,NA,NA,1,"No",
"Jess",NA,1,1,1,1,"Yes",
"George",NA,NA,1,NA,1,"No")
When gathering this multiple question result, I get (logically), multiple rows for 'Harry', 'Jess' and 'George':
tb1 %>%
gather(X,val,x1:x3,-id,-z) %>%
filter(!is.na(val)) %>%
select(-val) %>%
gather(Y,val,y1:y2,-id,-X,-z) %>%
filter(!is.na(val)) %>%
select(-val)
# A tibble: 7 x 4
id z X Y
<chr> <chr> <chr> <chr>
1 Jess Yes x2 y1
2 Jess Yes x3 y1
3 Harry No x1 y2
4 Harry No x2 y2
5 Jess Yes x2 y2
6 Jess Yes x3 y2
7 George No x3 y2
I'm a bit worried about the multiple entries, and was wondering whether there's a good strategy to deal with multiple choice questions of a survey with binary columns that need to be gathered.
In the end, I'd like to be able to plot and analyse the values of various variables: i.e. the amount of times that people selected y2.
It seems that this long format is not practical to analyse this, as the count() will go up for all of Harry's double mentions of y2.
The flow of questions I have regarding this topic is as follows:
- Would it be better/easier for analysis to gather multiple responses into a single column?
- If yes, how do you do this efficiently?
- If no, what are the implications that I have to watch out for in further analysis when I keep the multi-responses in long format?
- and how do you incorporate those implications into your code? (Maybe a specific "group" argument for id? Could you show me an example?)