1
votes

I am trying to join two datatables using rolling join. I have looked at various answers including here but unfortunately unable to locate one that helps in this case. I am borrowing the same example from the link posted.

my first dataset is a websession data for two users 1 and 2:

user  web_date_time
1   29-Oct-2016 6:10:03 PM
1   29-Oct-2016 7:34:17 PM
1   30-Oct-2016 2:08:03 PM
1   30-Oct-2016 3:55:12 PM
2   31-Oct-2016 11:32:12 AM
2   31-Oct-2016 2:59:56 PM
2   01-Nov-2016 12:49:44 PM

My second time stamp is for purchase:

user  purchase_date_time
    1   29-Oct-2016 6:10:00 PM
    1   29-Oct-2016 6:11:00 PM
    2   31-Oct-2016 11:35:12 AM
    2   31-Oct-2016 2:50:00 PM

My desired output is which web session led to a purchase but with a constraint. The constraint is - the websession should be after the previous purchase. The desired out is as follows (requires for all purchases, an additional column "websession_led_purchase" to be created ):

user  purchase_date_time          websession_led_purchase
    1   29-Oct-2016 6:10:00 PM    NA
    1   29-Oct-2016 6:11:00 PM    29-Oct-2016 6:10:03 PM
    2   31-Oct-2016 11:35:12 AM   31-Oct-2016 11:32:12 AM
    2   31-Oct-2016 2:50:00 PM    NA

The first NA is on account of no websession before that purchase, the second NA is on account of no websession after the previous purchase (and before the purchase) that led to the second purchase for user 2.

I tried using the roll join method of dt2[dt1,roll=Inf], however, I get "31-Oct-2016 11:32:12 AM" for the fourth row in the desired output, which is incorrect.

Let me know your advice.

1
What's basis of getting 29-Oct-2016 6:10:00 PM as first row in desired row? Why it cannot be 29-Oct-2016 6:10:59 PM?MKR
Also, The constraint is - the websession should be after the previous purchase is not true for 2nd row in desired output.MKR
@MKR- sorry for the error, just edited it with the right numbersvivek

1 Answers

0
votes

The rolling joins is behaving as expected.

The document suggests as:

+Inf (or TRUE) rolls the prevailing value in x forward. It is also known as last observation carried forward (LOCF).

That means the last observation can be carried forward and joined with for many records. Exactly the same is happening with 4th row where 2016-10-31 11:32:12 is coped and mapped with even next record (2016-10-31 14:50:00).

A simple way to fix this issue is to match lag value of websession_led_purchase with current row if those two are same then set value in current row as NA. This will ensure data was carried forwards only-once.

library(lubridate)
library(data.table)

setDT(DT1)
setDT(DT2)

DT1[,':='(date_time = dmy_hms(web_date_time), web_date_time = dmy_hms(web_date_time))]
DT2[, ':='(date_time = dmy_hms(purchase_date_time), 
                purchase_date_time = dmy_hms(purchase_date_time)) ]

setkey(DT1, user, date_time)
setkey(DT2, user, date_time)

DT1[DT2, roll= Inf][,.(user, purchase_date_time, 
 websession_led_purchase = as.POSIXct(ifelse(!is.na(shift(web_date_time)) & 
    web_date_time == shift(web_date_time), NA, web_date_time), 
    origin = "1970-01-01"))]

# user  purchase_date_time    websession_led_purchase
# 1:    1 2016-10-29 18:10:00                    <NA>
# 2:    1 2016-10-29 18:11:00     2016-10-29 19:10:03
# 3:    2 2016-10-31 11:35:12     2016-10-31 11:32:12
# 4:    2 2016-10-31 14:50:00                    <NA>