I am working on some survey data I'm interested in and I encountered a minor issue. There is some questions that ask respondents to check top 3 out of the candidates...
For example,
this is a list of fruits that you can choose 3 out of. 1) Banana 2) Apple 3) Grapefruit 4) Peach 5) Watermelon
and then, multiple respondents gave different answers to this question.
- respondent a - 1, 3, 4 (banana, grapefruit, peach)
- respondent b - 1, 2, 5
- respondent c - 3, 4
- (and so on)
and whoever in charge of cleaning the survey data formed this into three columns and each column represents one of three choices the respondent made.
Q1_1 Q2_2 Q3_3
a 1 3 4
b 1 2 5
c 3 4 NA
- My question is... is there any way I can make it into a one column? I know I can dummify them and make 5 columns corresponding to the types of fruits up there...
Banana Apple Grapefruit Peach Watermelon
a 1 0 1 1 0
b 1 1 0 0 1
c 0 0 1 1 0
- However, I am afraid this might hurt the accuracy of the predictive model that I expect to run later of my research. One of the questions provided the respondents around 990 options they can choose. If I stick to the dummification, the dimension of the data will increase significantly...
Please let me know if there is any good way! I also would love to know if there is any R package that is specified to be used in this type of matter.