0
votes

I am struggling days now to find a way to match two different tables (unbalanced in row number) in R with the matching range of dates.

The first data frame represents what someone has declared and the second data.frame represents what actually is.The problem is that an id say A might have kept for a week or more a amount of product in kilos in his/her storage and then appear it one week after.

So i have two data frames with different rows. Both number for rows exceed 500.000.The first has 500.000 and the second 520.000.(so some NAs might occur but this doesn't matter).

I want to seach and match from these 2 different data frames:

  1. the ids to are the same,
  2. the product code and the man code to be the same, and
  3. to check if the date in the second data frame is in the range of a month from the date in the first data frame (for example he/she declares a weight (100 kg) in 2020/1/1 and appears it in 2020/1/15 but the actual weight is 105).

and if all these conditions are true to sum the two weights (from the data frames 1 and 2)

The desired output must be a data frame with the matched id, range of date, the product , the man, the 2 columns weights and the last column to be the values that will be the sum of the weights.

the example data frame are the two data frames in R below :


id = rep("A",10)
date = seq(as.Date("2000/1/1"), by = "day", length.out = 10)
product = c("E1","E1","E2","E3","E3","E4","E1","E1","E5","E5")
man = c("PS","PS","PS","DDR","DDR","FFR","PS","PS","BA","BA")
weight = c(1505,300,259,231,140,150,300,112,203,2200) 
data1 = data.frame(id,date,product,man,weight);data1


id2 = rep("A",10)
date2 = sample(seq(as.Date("2000/1/10"), by = "day", length.out = 10),10)
product2 = c("E3","E3","E1","E1","E1","E1","E5","E4","E5","E2")
man2 = c("DDR","DDR","PS","PS","PS","PS","BA","FFR","BA","PS")
weight2 = c(259,295,260,240,135,145,250,110,2000,210) 
data2 = data.frame(id2,date2,product2,man2,weight2);data2

How can I match them with the ginned date latency condition ?

Any help ?

1

1 Answers

1
votes

not sure if it's what you want but it looks like a rolling join...

Using data.table and lubridate packages, one version could be:

(Here I show two examples one with 30 days forward and one with one month forward)

set.seed(0)

id = rep("A",10)
date = seq(as.Date("2000/1/1"), by = "day", length.out = 10)
product = c("E1","E1","E2","E3","E3","E4","E1","E1","E5","E5")
man = c("PS","PS","PS","DDR","DDR","FFR","PS","PS","BA","BA")
weight = c(1505,300,259,231,140,150,300,112,203,2200) 
data1 = data.frame(id,date,product,man,weight);data1


id2 = rep("A",10)
date2 = sample(seq(as.Date("2000/1/10"), by = "day", length.out = 10),10)
product2 = c("E3","E3","E1","E1","E1","E1","E5","E4","E5","E2")
man2 = c("DDR","DDR","PS","PS","PS","PS","BA","FFR","BA","PS")
weight2 = c(259,295,260,240,135,145,250,110,2000,210) 
data2 = data.frame(id2,date2,product2,man2,weight2);data2

require(data.table)
require(lubridate)

setDT(data1)
setkey(data1)
setDT(data2)
setkey(data2)

data1[, weight_p50 := weight + 50]
data1[, weight_m50 := weight - 50]


data1[, date := as.Date(date)]
data2[, date2 := as.Date(date2)]
data1[, date_plus_one_month := as.Date(ymd(date %m+% months(1)))]
data1[, date_plus_30_days := as.Date(date + days(30))]

data2[, date2_ := date2]
data1[, date_ := date]

res <-
    data1[data2, list(id, product, man, date = date_, date2, weight, weight2, total_weight = weight + weight2),
          on = .(id = id2,
                 product = product2,
                 man = man2, 
                 date_plus_30_days >= date2, 
                 date <= date2_,
                 weight_p50 >= weight2,
                 weight_m50 <= weight2),
          allow.cartesian = TRUE, nomatch = 0
    ][][]
setkey(res, id, product, man, date, date2)
res <- unique(res, by = c("id", "product", "man"))
res[]


which gives:

   id product man       date      date2 weight weight2 total_weight
1:  A      E1  PS 2000-01-02 2000-01-16    300     260          560
2:  A      E2  PS 2000-01-03 2000-01-17    259     210          469
3:  A      E3 DDR 2000-01-04 2000-01-18    231     259          490
4:  A      E4 FFR 2000-01-06 2000-01-19    150     110          260
5:  A      E5  BA 2000-01-09 2000-01-12    203     250          453

EDIT: following OP comments, I added unicity on id x man x product such that it keeps only first dates satisfying the constraints.