I have a data table
DT <- data.table(col1=c("a", "b", "c", "c", "a"), col2=c("b", "a", "c", "a", "b"), condition=c(TRUE, FALSE, FALSE, TRUE, FALSE))
col1 col2 condition
1: a b TRUE
2: b a FALSE
3: c c FALSE
4: c a TRUE
5: a b FALSE
and would like to remove rows on the following conditions:
- each row for which
condition==TRUE
(rows 1 and 4) - each row that has the same values for col1 and col2 as a row for which the
condition==TRUE
(that is row 5, col1=a, col2=b) - finally each row that has the same values for col1 and col2 for which
condition==TRUE
, but with col1 and col2 switched (that is row 2, col1=b and col2=a)
So only row 3 should stay.
I'm doing this by making a new data table DTcond
with all rows meeting the condition, looping over the values for col1 and col2, and collecting the indices from DT
which will be removed.
DTcond <- DT[condition==TRUE,]
indices <- c()
for (i in 1:nrow(DTcond)) {
n1 <- DTcond[i, col1]
n2 <- DTcond[i, col2]
indices <- c(indices, DT[ ((col1 == n1 & col2 == n2) | (col1==n2 & col2 == n1)), which=T])
}
DT[!indices,]
col1 col2 condition
1: c c FALSE
This works but is terrible slow for large datasets and I guess there must be other ways in data.table to do this without loops or apply. Any suggestions how I could improve this (I'm new to data.table)?