Can a column be a vector or list class?

Question

I'm working with multiple response questions in a survey, and I have a character column that contains values that look like "1,2,3" and "1,4,5". The participants click all values that apply, and I"m given this result.

What is the best solution to deal with this problem? Should I create new columns that tell me if a value in that list is present or not? Or can I create a column that has a list/vector class?

What is your desired output? Yes, in general you should avoid working with CSV data if you can. — Tim Biegeleisen
Technically, yes, list-columns can work in frames (try mtcars$new <- Map(c, mtcars$gear, mtcars$carb)), but some frame-friendly tools don't always react well to them (though there are always workarounds). A different approach might be to store the different values in a "long" format instead of storing multiple values in a single "cell". This takes restructuring of the rest of the data, so is not simple enough for a comment (and needs a more-reproducible problem). — r2evans
Convert those values into separate rows. stackoverflow.com/questions/15347282/… — Ronak Shah
Really it depends on your goals - what do you want to do with the data?? But yes, in most cases I agree with Ronak, separate rows are easiest. — Gregor Thomas
If you only have 5 options, then it sometimes makes sense to create 5 columns that are just 0/1 to signify whether or not that response was ticked, rather than going to separate rows. So you could have columns like Q1_a, Q1_b, etc., even better if you replace a and b with the actual name of what they selected. — Marius

G. Grothendieck G. Grothendieck · Accepted Answer · 2019-09-11T02:13:55

One can't say what is best without knowing the purpose but storing them as indicator columns, i.e. one 0/1 column per option, would let you perform a regression or tabulate them easily. Here we convert x into a 0/1 matrix m and then consider what fraction of respondents answered yes to each question and we also regress with them in various ways of which two are shown, take various correlations and plots.

We also show a plot based on applying stack from to the list representation so it might be useful to use more than one representation and convert among them.

x <- c("1,2,3", "1,4,5")
m <- t(+outer(1:5, lapply(strsplit(x, ","), as.numeric), Vectorize(`%in%`)))

colMeans(m)

y <- 1:2
lm(y ~ m+0)
lapply(1:5, function(i) glm(m[, i] ~ y, family = binomial()))

cor(m)
cor(t(m))

heatmap(m)

stk <- stack(setNames(lapply(strsplit(x, ","), as.numeric), seq_along(x)))
plot(stk)

Here is a data frame with 4 different possibilities:

library(dst) # encode/decode

DF <- data.frame(x, stringsAsFactors = FALSE)
DF$list <- strsplit(x, ",")
DF <- cbind(DF, m, code = apply(m, 1, decode, base = 2))
DF
##       x     list  1 2 3 4 5  code
## 1 1,2,3  1, 2, 3  1 1 1 0 0    28
## 2 1,4,5  1, 4, 5  1 0 0 1 1    19

Note that decode converts 0/1 values into a numeric value and encode can be used to reverse that:

t(encode(base = rep(2, 5), c(28, 19)))
##   [,1] [,2] [,3] [,4] [,5]
## r    1    1    1    0    0
##      1    0    0    1    1

Can a column be a vector or list class?

1 Answers