4
votes

I have a 'check all that apply' item from a survey I would like to process. The data come from a string variable in which each choice a respondent makes is encoded into the same variable. Respondents may choose from a list of 21 options, all that apply to them. I would like to create a set of 21 dummy variables indicating yes/no whether or not a respondent selected a particular option.

Three example responses are:

id  x 
1   3, 13
2   1, 3, 8, 9, 11, 13
3   1, 9
...

And what I would like is:

id  x                   x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13   
1   3, 13                0  0  1  0  0  0  0  0  0   0   0   0   1 
2   1, 3, 8, 9, 11, 13   1  0  1  0  0  0  0  1  1   0   1   0   1
3   1, 9                 1  0  0  0  0  0  0  0  1   0   0   0   0
...

In my attempt to do this, I've read an id variable and the response variable into a list jp such that each respondent has an id in jp[[1]] and his/her response in jp[[2]]:

> jp[[2]][1:3]
[1] "3, 13                                                                     "
[2] "1, 3, 8, 9, 11, 13                                                        "
[3] "1, 9                                                                      "

I then cleaned them up via strsplit on the commas and put that in jp[[4]]:

> jp[[4]][1:3]
[[1]]
[1] "3"  "13"

[[2]]
[1] "1"  "3"  "8"  "9"  "11" "13"

[[3]]
[1] "1" "9"

I found the unique values across all list elements:

> taught <- as.character(sort(as.numeric(unique(unlist(jp[[4]])))))
> taught
 [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "256"

Through a little trial and error, I figured out I could process each respondents' choices as follows:

sapply(jp[[4]], function(x) any(x == "1"))

And this appears to work ok:

> table(sapply(jp[[4]], function(x) any(x == "1")))

FALSE  TRUE 
 9404  1891 

This is the prevalence I expect.

However, because each respondent can have from 0-21 responses (sublist elements), I figured I needed to loop through each unique response in each respondents' sublist, writing out the results to a new list element.

I'm hoping to take the list element jp[[4]], where the cleaned up responses are and loop through each element of 'taught' to see if exists in each respondents sublist.

bla <- function(dt, lst) {
for (i in 1:length(lst)) {
            subs <- list()
            # apply function on each part, by row
            subs[[i]] <- sapply(dt, function(x) any(x == taught[i]))
    }  
    return(subs)
    }

bla(jp[[4]], taught)

Unfortunately, it only appears to work for the last (the 21st, or '256') element in 'taught', and does not save to my list 'subs' I defined in the function.

> table(bla(jp[[4]], taught)[21])

FALSE  TRUE 
10645   650 

> table(sapply(jp[[4]], function(x) any(x == "256")))

FALSE  TRUE 
10645   650 

Suggestions welcome. Thanks.

2

2 Answers

5
votes

, as a separator in your dataset will pose a problem. If you replace that with some other character, like -, then it would make it easier to work with. Assuming you can do that, then this should work.

tally<-function(df)
{
#create a data.frame with 23 columns, one for id, one for original x and 21 for responses   
response_table=data.frame(matrix(nrow=1,ncol=23))
names(response_table)=c("id","x",paste("x",1:21,sep=""))
response_table$id=df$id
response_table$x=df$x
response_table[,3:23]=0 
# Change the - to whatever separator you use
response_table[,as.numeric(unlist(str_split(df$x,'-')))+2]=1
return(response_table)
}



library(stringr)
test_data=data.frame(id=1:3,x=c("3-13","1-3-8-9-11-13","1-9"))

> test_data
  id             x
1  1          3-13
2  2 1-3-8-9-11-13
3  3           1-9
responses=ddply(test_data, .(id), tally)



> responses
  id             x x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21
1  1          3-13  0  0  1  0  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   0   0
2  2 1-3-8-9-11-13  1  0  1  0  0  0  0  1  1   0   1   0   1   0   0   0   0   0   0   0   0
3  3           1-9  1  0  0  0  0  0  0  0  1   0   0   0   0   0   0   0   0   0   0   0   0
2
votes

EXAMPLE DATA

test_data=data.frame(id=1:3,x=c("3,13","1,3,8,9,11,13","1,9"), 
                     stringsAsFactors=FALSE)

SOLUTION

test_data_resp <- ddply(test_data,.(id),function(data,vc) {
  v1 <- as.numeric(strsplit(data$x,split=",")[[1]])
  vc[v1] <- 1
  return(vc)},vc = numeric(23)
  )