3
votes

Say I have a data.table such as: (or with numbers and NAs)

temp <- data.table(M=c(NA,T,T,F,F,F,NA,NA,F), 
                   P=c(T,T,T,F,F,F,NA,NA,NA), S=c(T,F,NA,T,F,NA,NA,NA,NA))

    M     P     S
   NA  TRUE  TRUE
 TRUE  TRUE FALSE
 TRUE  TRUE    NA
FALSE FALSE  TRUE
FALSE FALSE FALSE
FALSE FALSE    NA
   NA    NA    NA
   NA    NA    NA
FALSE    NA    NA

And I want to check if whenever a variable is NA implies that the values of a second variable are all NA as well. To check if some variables are linked to other.

For example, whenever P=NA we have also S=NA.

This code works properly for two single columns:

temp[is.na(P),all(is.na(S))]

gives TRUE

and

temp[is.na(S),all(is.na(P))]

gives FALSE because the sixth row is S=NA but P!=NA.

Now my question. I would like to generalize it, checking all pairs in my data.table and print what pairs are "linked".
I'd prefer to print only the results that are TRUE, ignoring the FALSE ones because most pairs in my real data.table won't be linked, and I have 550 variables.

I've tried this code:

temp[, lapply(.SD, function(x) temp[is.na(x), 
                 lapply(.SD, function(y)  all(is.na(y)) )]]

I get this error

Error: unexpected ']' in: "temp[, lapply(.SD, function(x) temp[is.na(x), lapply(.SD, function(y) all(is.na(y)) )]]"

I could try with a for loop but I'd prefer the typical data.table syntax. Any suggestion is welcome.

I would also like to know how to refer to two different .SD when you are nesting data.table calls.

2

2 Answers

5
votes

For combinations in pairs, crossprod seems yet useful.

We only care for whether a value is NA or not:

NAtemp = is.na(temp)

Compare the co-existence of NAs:

crossprod(NAtemp)
#  M P S
#M 3 2 2
#P 2 3 3
#S 2 3 5

with the number of NA per column:

colSums(NAtemp)
#M P S 
#3 3 5

like:

ans = crossprod(NAtemp) == colSums(NAtemp)
ans
#      M     P     S
#M  TRUE FALSE FALSE
#P FALSE  TRUE  TRUE
#S FALSE FALSE  TRUE

And use the convenient as.data.frame.table to format:

subset(as.data.frame(as.table(ans)), Var1 != Var2)
#  Var1 Var2  Freq
#2    P    M FALSE
#3    S    M FALSE
#4    M    P FALSE
#6    S    P FALSE
#7    M    S FALSE
#8    P    S  TRUE
3
votes

We can try with combn

unlist(combn(names(temp), 2, FUN = function(nm)
  list(setNames(temp[is.na(get(nm[1])), all(is.na(get(nm[2])))], paste(nm, collapse="-"))))) 
#   M-P   M-S   P-S 
# FALSE FALSE  TRUE 

Or if we also need all the combinations

d1 <- CJ(names(temp), names(temp))[V1!=V2]
d1[,  .(index=temp[is.na(get(V1)), all(is.na(get(V2)))]) , .(V1, V2)]
#    V1 V2 index
#1:  M  P FALSE
#2:  M  S FALSE
#3:  P  M FALSE
#4:  P  S  TRUE
#5:  S  M FALSE
#6:  S  P FALSE