3
votes

So I have a dataframe like this

DF1

ID    timestamp              value
1     2018-07-31 20:00:00    45555
1     2018-08-01 06:50:00    333
1     2018-08-01 07:00:00    322
2     2018-08-01 13:00:00    1222
2     2018-08-01 22:00:00    1111
3     2018-08-01 11:00:00    0
3     2018-08-02 08:00:00    22222

And another one like this

DF2

          startTime             endTime   ID    
2018-08-01 06:45:00 2018-08-01 09:49:00   1
2019-12-02 11:01:00 2019-12-02 11:02:00   2
2018-08-02 07:59:00 2018-08-02 08:50:00   2
2018-08-02 07:59:00 2018-08-02 08:50:00   3

So I want to create a new column using mutate which mutates DF2 that tells me if any time in between the startTime and endTime appears in DF1 that matches the ID.

So in DF2 if we have a date like 8/2/2018 and the time span is 7:59AM - 8:50AM , if any timestamp within this interval inclusive (2018-08-02 7:59:00, 2018-08-02 8:00:00, 2018-08-02 8:01:00...) appears in DF1 then a variable in DF2 reads 1, else 0. This is only the case if the ID matches. Does not matter how many times, atleast once is what matters.

Final version of DF2 should look like.

          startTime             endTime   ID   match
2018-08-01 06:45:00 2018-08-01 09:49:00   1    1
2019-12-02 11:01:00 2019-12-02 11:02:00   2    0
2018-08-02 07:59:00 2018-08-02 08:50:00   2    0
2018-08-02 07:59:00 2018-08-02 08:50:00   3    1

Ideally I would like to use BETWEEN() and remain in the tidyverse.

2
Are the ddate time columnsof correct date time classakrun
yup, they match @akrunJohn Thomas
I think this can be better solved in data.table i.e. more efficientlyakrun
DF1 is a postgresql database btw; and both classes are POSIXct btw.John Thomas
@akrun mutate is what makes sense to me but if you see a better alternative I'm open to suggestionsJohn Thomas

2 Answers

2
votes

We could use a data.table non-equi join on the DateTime columns along with 'ID' column, create the match by checking if the number of rows (.N) is greater than 0

library(data.table)
DF2$match <- 0L
setDT(DF2)[DF1,  match := +(.N > 0),on = .(ID, 
        startTime < timestamp, endTime > timestamp), by = .EACHI]
DF2
#             startTime             endTime ID match
#1: 2018-08-01 06:45:00 2018-08-01 09:49:00  1     1
#2: 2019-12-02 11:01:00 2019-12-02 11:02:00  2     0
#3: 2018-08-02 07:59:00 2018-08-02 08:50:00  2     0
#4: 2018-08-02 07:59:00 2018-08-02 08:50:00  3     1

data

DF1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
 timestamp = structure(c(1533085200, 
1533124200, 1533124800, 1533146400, 1533178800, 1533139200, 1533214800
), class = c("POSIXct", "POSIXt"), tzone = ""), value = c(45555L, 
333L, 322L, 1222L, 1111L, 0L, 22222L)), row.names = c(NA, -7L
), class = "data.frame")

DF2 <- structure(list(startTime = structure(c(1533123900, 1575306060, 
1533214740, 1533214740), class = c("POSIXct", "POSIXt"), tzone = ""), 
    endTime = structure(c(1533134940, 1575306120, 1533217800, 
    1533217800), class = c("POSIXct", "POSIXt"), tzone = ""), 
    ID = c(1L, 2L, 2L, 3L)), row.names = c(NA, -4L), class = "data.frame")
1
votes

Per your request for a tidyverse solution, you can also do this by first joining the two dfs by 'ID', then using lubridate's %within% and interval (%--%) functions to find matches:

library(dplyr)
library(lubridate)

df_both <- left_join(DF2, DF1, by = "ID") 

df_both %>%
  group_by(ID, startTime, endTime) %>%
  summarize(match = any(timestamp %within% (startTime %--% endTime)),
            .groups = "drop")
#> # A tibble: 4 x 4
#>      ID startTime           endTime             match
#>   <int> <dttm>              <dttm>              <lgl>
#> 1     1 2018-08-01 06:45:00 2018-08-01 09:49:00 TRUE 
#> 2     2 2018-08-02 07:59:00 2018-08-02 08:50:00 FALSE
#> 3     2 2019-12-02 12:01:00 2019-12-02 12:02:00 FALSE
#> 4     3 2018-08-02 07:59:00 2018-08-02 08:50:00 TRUE

If you prefer 1/0 instead of TRUE/FALSE, you can use as.integer to convert the 'match' column as such.