0
votes

I am trying to filter a data frame using a vector of datetime objects (POSIXct). The first vector contains the start times and the second vector contains the end times. I want to return only values that fall within one of the datetime pairs, i.e., is between one pair of start and end times.

Here's some sample data:

dat <- structure(list(timestamp = structure(c(1604386800, 1604386801, 
                                              1604386802, 1604386803,
                                              1604386804, 1604386805,
                                              1604473200, 1604473201,
                                              1604473202, 1604473203,
                                              1604473204, 1604473205,
                                              1604386800, 1604386801,
                                              1604386802, 1604386803,
                                              1604386804, 1604386805,
                                              1604473200, 1604473201,
                                              1604473202, 1604473203,
                                              1604473204, 1604473205,
                                              1604586800, 1604586801, 
                                              1604586802, 1604586803,
                                              1604586804, 1604586805,
                                              1604586800, 1604586801, 
                                              1604586802, 1604586803,
                                              1604586804, 1604586805),
                                            class = c("POSIXct", "POSIXt"),
                                            tzone = "UTC"),
                      process_time = c(0L, 1L, 2L, 3L, 4L,5L, 0L, 1L, 2L,
                                       3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L,
                                       0L, 1L, 2L,3L, 4L, 5L, 0L, 1L, 2L,
                                       3L, 4L,5L, 0L, 1L, 2L, 3L, 4L,5L),
                      process_name = c("A", "A", "A", "A", "A", "A", "A",
                                       "A", "A", "A", "A", "A", "B", "B", 
                                       "B", "B", "B", "B", "B", "B", "B",
                                       "B", "B", "B", "C", "C", "C", "C", 
                                       "C", "C", "A", "A", "A", "A", "A",
                                       "A")),
                 class = "data.frame", row.names = c(NA, -36L))

And here's what I've got so far:

library(lubridate)
library(dplyr)

start_time <- ymd_hms(c("2020-11-03 07:00:01", "2020-11-04 07:00:01",
                        "2020-11-05 14:33:21"))
end_time <- ymd_hms(c("2020-11-03 07:00:04", "2020-11-04 07:00:04",
                      "2020-11-05 14:33:24"))

filtered_dat <- dat %>%
  group_by(date(timestamp), process_name) %>%
  filter(timestamp >= start_time & timestamp <= end_time)

What I am looking for is the first and last value being filtered out for each date and process type. It would look something like:

timestamp             |   process_time    |   process_name
------------------------------------------------------------
2020-11-03 07:00:01   |         1         |        A
2020-11-03 07:00:02   |         2         |        A
2020-11-03 07:00:03   |         3         |        A
2020-11-03 07:00:04   |         4         |        A

2020-11-04 07:00:01   |         1         |        A
2020-11-04 07:00:02   |         2         |        A
2020-11-04 07:00:03   |         3         |        A
2020-11-04 07:00:04   |         4         |        A

2020-11-03 07:00:01   |         1         |        B
2020-11-03 07:00:02   |         2         |        B
2020-11-03 07:00:03   |         3         |        B
2020-11-03 07:00:04   |         4         |        B
       ...                     ...                ...

#spacing added for clarity

What appears to be happening is the start and end time vectors are being recycled. So it compares the first row of the dataframe to the first items in the start and end time vectors, then the second row is compared to the second items in the start and end time vectors, and so on. Every few rows it does get a match.

What I need to happen is each row of the dataframe is compared to all three datetime pairs before moving onto the next row in the dataframe. It almost seems like purrr would be the answer here.

2

2 Answers

1
votes

Reading up on lubridate we could use intervals for this purpose. If there are multiple intervals then the documentation recommends we turn this into a list:

intervals <- list(interval(start_time, end_time))

Suppose we want to check if a certain date is within these intervals we can use the %within% function:

ts <- ymd_hms(2020-11-03 07:00:02)
ts %within% intervals

Results in:

> ts %within% intervals
[1]  TRUE FALSE FALSE

Or if we want to check all intervals at the same time:

> any(ts %within% intervals)
[1] TRUE

Applying this to your data frame:

dat <- dat %>%
  rowwise() %>%
  mutate(within = any(timestamp %within% intervals))

After which you can use simple filtering.

1
votes

You can use map2_lgl from purrr to check if any timestamp in each date is between start_time and end_time.

library(dplyr)
library(lubridate)
library(purrr)

dat %>%
  group_by(date = date(timestamp), process_name) %>%
  filter(any(map2_lgl(start_time, end_time, ~any(between(timestamp, .x, .y)))))