R data.table remove rows where one column is duplicated if another column is NA

Question

Here is an example data.table

dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA))

   col1   col2
1:    A     NA
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. AKA group by col1, then if group has more than one row and one of them is NA, remove it. This would be the result for dt:

   col1   col2
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I tried this:

dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1]

   col1 col2
1:    A  dog
2:    B  cat
3:    C jeep
4:    D   NA

What am I missing? Thank you

Psidom Psidom · Accepted Answer · 2017-08-07T23:30:29

You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1; And also used ifelse on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); If you want to remove NA values from a group when there are non NAs, you can use if/else:

dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]

#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

R data.table remove rows where one column is duplicated if another column is NA

3 Answers