One can't say what is best without knowing the purpose but storing them as indicator columns, i.e. one 0/1 column per option, would let you perform a regression or tabulate them easily. Here we convert x into a 0/1 matrix m and then consider what fraction of respondents answered yes to each question and we also regress with them in various ways of which two are shown, take various correlations and plots.
We also show a plot based on applying stack from to the list representation so it might be useful to use more than one representation and convert among them.
x <- c("1,2,3", "1,4,5")
m <- t(+outer(1:5, lapply(strsplit(x, ","), as.numeric), Vectorize(`%in%`)))
colMeans(m)
y <- 1:2
lm(y ~ m+0)
lapply(1:5, function(i) glm(m[, i] ~ y, family = binomial()))
cor(m)
cor(t(m))
heatmap(m)
stk <- stack(setNames(lapply(strsplit(x, ","), as.numeric), seq_along(x)))
plot(stk)
Here is a data frame with 4 different possibilities:
library(dst) # encode/decode
DF <- data.frame(x, stringsAsFactors = FALSE)
DF$list <- strsplit(x, ",")
DF <- cbind(DF, m, code = apply(m, 1, decode, base = 2))
DF
## x list 1 2 3 4 5 code
## 1 1,2,3 1, 2, 3 1 1 1 0 0 28
## 2 1,4,5 1, 4, 5 1 0 0 1 1 19
Note that decode converts 0/1 values into a numeric value and encode can be used to reverse that:
t(encode(base = rep(2, 5), c(28, 19)))
## [,1] [,2] [,3] [,4] [,5]
## r 1 1 1 0 0
## 1 0 0 1 1
list-columns can work in frames (trymtcars$new <- Map(c, mtcars$gear, mtcars$carb)), but some frame-friendly tools don't always react well to them (though there are always workarounds). A different approach might be to store the different values in a "long" format instead of storing multiple values in a single "cell". This takes restructuring of the rest of the data, so is not simple enough for a comment (and needs a more-reproducible problem). - r2evansQ1_a,Q1_b, etc., even better if you replaceaandbwith the actual name of what they selected. - Marius