0
votes

I have a dataset consisting of id-name pairs. These pairs can appear multiple times. In some rows the value in the name column is missing, in others the id is missing:

id_name = data.table(stringsAsFactors=FALSE,
                 id = c("x123", "xy234","x123",NA,"z123","y123","d654","d654"),
                 name = c("john", "mary", NA,"mary","bob","bob", "john",NA)
                     )

I would like to replace only the NA values in one column with the (non-missing) value from another row, using the other column as lookup value. The result should be:

            id name
      1:  x123 john
      2: xy234 mary
      3:  x123 john
      4: xy234 mary
      5:  z123  bob
      6:  y123  bob
      7:  d654 john
      8:  d654 john

What is the idiomatic way of achieving this, the data.table way? The following approach for updating only the name column does not produce an error, but it does not update the dataset either :

id_name[is.na(name)][id_name[!is.na(name)], on="id", name:=i.name]
> id_name
      id name
1:  x123 john
2: xy234 mary
3:  x123 <NA>
4:  <NA> mary
5:  z123  bob
6:  y123  bob
7:  d654 john
8:  d654 <NA>

However, just subset-and-join without update works:

id_name[is.na(name)][id_name[!is.na(name)], on="id",nomatch=0L]

     id name i.name
1: x123 <NA>   john
2: d654 <NA>   john

UPDATE

Updated MRE to make clear that some names can be associated with 2 ids. Both ids have to be kept.

3

3 Answers

1
votes

I think you are looking for:

id_name[is.na(name), name :=
    id_name[!is.na(name)][.SD, on=.(id), x.name]
]
id_name[is.na(id), id := 
    id_name[!is.na(id)][.SD, on=.(name), mult="first", x.id]
]
id_name
1
votes

We can group by 'name', 'id', and replace the NA elements in 'id' by the first non-NA value

id_name[, id := id[!is.na(id)][1], name]
id_name[, name := name[!is.na(name)], id]

-output

id_name
#     id  name
#1:  x123  john
#2: xy234 marry
#3:  x123  john
#4: xy234 marry

Update

We can use fill from tidyr

library(tidyr)
id_name1 <- id_name[, fill(.SD, id), name][,  fill(.SD, name) , id]
id_name1
#      id name
#1:  x123 john
#2:  x123 john
#3:  d654 john
#4:  d654 john
#5: xy234 mary
#6: xy234 mary
#7:  z123  bob
#8:  y123  bob
0
votes

You can use zoo's na.locf to fill the missing values.

library(zoo)
library(data.table)

id_name[, id := na.locf(id), name]
id_name[, name := na.locf(name), id]
id_name

#      id  name
#1:  x123  john
#2: xy234 marry
#3:  x123  john
#4: xy234 marry