1
votes

I would like to match multiple conditions from independent data tables onto my main data table. How can I do this using the data.table package? What would be the most efficient/fastest way?

I have a mock example, with some mock conditions here to illustrate my question:

main_data <- data.frame( pnum = c(1,2,3,4,5,6,7,8,9,10),
                         age = c(24,35,43,34,55,24,36,43,34,54),
                         gender = c("f","m","f","f","m","f","m","f","f","m"))

data_1 <- data.frame( pnum = c(1,4,5,8,9),
                      value_data_1 = c(1, 2, 1, 1, 1),
                      date = as.Date(c("2019-01-01", "2018-07-01", "2018-01-01", "2016-07-01", "2016-07-01")))

data_2 <- data.frame( pnum = c(1,5,7,8,9),
                      value_data_2 = c(1, 2, 1, 1, 2),
                      date = as.Date(c("2019-01-01", "2018-07-01", "2018-01-01", "2016-07-01", "2016-07-01")))

I would like to create a new variable in my main_data table called "matching" of those rows that match between data_1 and data_2 under multiple conditions:

  • First, the value of data_1$value_data_1 has to be equal to 1.
  • Second, the value of data_2$value_data_2 also has to be equal to 1.
  • Third, the pnum and the date should match between data_1 and data_2.

When all these conditions are met, I would expect the new output of main_data to look like this:

> main_data
   pnum age gender matching
1     1  24      f        1
2     2  35      m        0
3     3  43      f        0
4     4  34      f        0
5     5  55      m        0
6     6  24      f        0
7     7  36      m        0
8     8  43      f        1
9     9  34      f        0
10   10  54      m        0

Until now, I programmed each condition seperately and created new placeholder tables in between, but this is not very memory efficient. Is there an efficient way to chain all the conditions using the data.tables package specifically?

2

2 Answers

1
votes

Here's one way:

library(data.table)
library(magrittr)
setDT(main_data)
setDT(data_1)
setDT(data_2)

main_data %>%
  data_1[., on = .(pnum == pnum) ] %>%
  data_2[., on = .(pnum == pnum, date == date) ] %>%
  .[, matching := fcoalesce(+(value_data_1 == 1 & value_data_2 == 1), 0L) ] %>%
  .[, .(pnum, age, gender, matching) ]
#     pnum age gender matching
#  1:    1  24      f        1
#  2:    2  35      m        0
#  3:    3  43      f        0
#  4:    4  34      f        0
#  5:    5  55      m        0
#  6:    6  24      f        0
#  7:    7  36      m        0
#  8:    8  43      f        1
#  9:    9  34      f        0
# 10:   10  54      m        0

I used the magrittr package because I find it useful for portraying the flow code. It is not at all required, and the alternative pipeline for data.table for the same code could be:

data_2[
  data_1[main_data, on = .(pnum == pnum) ]
 ,on = .(pnum == pnum, date == date)
][ ,matching := fcoalesce(+(value_data_1 == 1 & value_data_2 == 1), 0L)
  ][ ,.(pnum, age, gender, matching) ]

There are other ways to break it out, including the use of temporary (mid-step) variables. (This is mostly style and personal preference.)

1
votes

You can use something like Reduce(merge, list(...))

library(data.table)

setDT(main_data); setDT(data_1); setDT(data_2)

res <- Reduce(function(x, y) {
  merge(x, y, by = "pnum", all.x = TRUE)
}, list(main_data, data_1[, -"date"], data_2[, -"date"]))[, `:=`(
  matching = 1L - (value_data_1 != 1 | value_data_2 != 1 | is.na(value_data_1) | is.na(value_data_2)), 
  value_data_1 = NULL,
  value_data_2 = NULL
)]

Output

> res[]
    pnum age gender matching
 1:    1  24      f        1
 2:    2  35      m        0
 3:    3  43      f        0
 4:    4  34      f        0
 5:    5  55      m        0
 6:    6  24      f        0
 7:    7  36      m        0
 8:    8  43      f        1
 9:    9  34      f        0
10:   10  54      m        0