0
votes

I have a data table

DT <- data.table(col1=c("a", "b", "c", "c", "a"), col2=c("b", "a", "c", "a", "b"), condition=c(TRUE, FALSE, FALSE, TRUE, FALSE))

   col1 col2 condition
1:    a    b      TRUE
2:    b    a     FALSE
3:    c    c     FALSE
4:    c    a      TRUE
5:    a    b     FALSE

and would like to remove rows on the following conditions:

  • each row for which condition==TRUE (rows 1 and 4)
  • each row that has the same values for col1 and col2 as a row for which the condition==TRUE (that is row 5, col1=a, col2=b)
  • finally each row that has the same values for col1 and col2 for which condition==TRUE, but with col1 and col2 switched (that is row 2, col1=b and col2=a)

So only row 3 should stay.

I'm doing this by making a new data table DTcond with all rows meeting the condition, looping over the values for col1 and col2, and collecting the indices from DT which will be removed.

DTcond <- DT[condition==TRUE,]
indices <- c()
for (i in 1:nrow(DTcond)) {
    n1 <- DTcond[i, col1]
    n2 <- DTcond[i, col2]
    indices <- c(indices, DT[ ((col1 == n1 & col2 == n2) | (col1==n2 & col2 == n1)), which=T])
}

DT[!indices,]
   col1 col2 condition
1:    c    c     FALSE

This works but is terrible slow for large datasets and I guess there must be other ways in data.table to do this without loops or apply. Any suggestions how I could improve this (I'm new to data.table)?

1

1 Answers

4
votes

You can do an anti join:

mDT = DT[(condition), !"condition"][, rbind(.SD, rev(.SD), use.names = FALSE)]
DT[!mDT, on=names(mDT)]

#    col1 col2 condition
# 1:    c    c     FALSE