I have a data.table
with more than 200 variables which are all binary. I want to create a new column in it that counts the difference between each row and a reference vector:
#Example
dt = data.table(
"V1" = c(1,1,0,1,0,0,0,1,0,1,0,1,1,0,1,0),
"V2" = c(0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0),
"V3" = c(0,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0),
"V4" = c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0),
"V5" = c(1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0)
)
reference = c(1,1,0,1,0)
I can do that with a small for loop, such as
distance = NULL
for(i in 1:nrow(dt)){
distance[i] = sum(reference != dt[i,])
}
But it's kind of slow and surely not the best way to do this. I tried:
dt[,"distance":= sum(reference != c(V1,V2,V3,V4,V5))]
dt[,"distance":= sum(reference != .SD)]
But neither works, as they return the same value for all rows. Also, a solution where I don't have to type all the variable names would be much better, as the real data.table has over 200 columns