I am trying to filter a data frame using a vector of datetime objects (POSIXct). The first vector contains the start times and the second vector contains the end times. I want to return only values that fall within one of the datetime pairs, i.e., is between one pair of start and end times.
Here's some sample data:
dat <- structure(list(timestamp = structure(c(1604386800, 1604386801,
1604386802, 1604386803,
1604386804, 1604386805,
1604473200, 1604473201,
1604473202, 1604473203,
1604473204, 1604473205,
1604386800, 1604386801,
1604386802, 1604386803,
1604386804, 1604386805,
1604473200, 1604473201,
1604473202, 1604473203,
1604473204, 1604473205,
1604586800, 1604586801,
1604586802, 1604586803,
1604586804, 1604586805,
1604586800, 1604586801,
1604586802, 1604586803,
1604586804, 1604586805),
class = c("POSIXct", "POSIXt"),
tzone = "UTC"),
process_time = c(0L, 1L, 2L, 3L, 4L,5L, 0L, 1L, 2L,
3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L,
0L, 1L, 2L,3L, 4L, 5L, 0L, 1L, 2L,
3L, 4L,5L, 0L, 1L, 2L, 3L, 4L,5L),
process_name = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "C", "C", "C", "C",
"C", "C", "A", "A", "A", "A", "A",
"A")),
class = "data.frame", row.names = c(NA, -36L))
And here's what I've got so far:
library(lubridate)
library(dplyr)
start_time <- ymd_hms(c("2020-11-03 07:00:01", "2020-11-04 07:00:01",
"2020-11-05 14:33:21"))
end_time <- ymd_hms(c("2020-11-03 07:00:04", "2020-11-04 07:00:04",
"2020-11-05 14:33:24"))
filtered_dat <- dat %>%
group_by(date(timestamp), process_name) %>%
filter(timestamp >= start_time & timestamp <= end_time)
What I am looking for is the first and last value being filtered out for each date and process type. It would look something like:
timestamp | process_time | process_name
------------------------------------------------------------
2020-11-03 07:00:01 | 1 | A
2020-11-03 07:00:02 | 2 | A
2020-11-03 07:00:03 | 3 | A
2020-11-03 07:00:04 | 4 | A
2020-11-04 07:00:01 | 1 | A
2020-11-04 07:00:02 | 2 | A
2020-11-04 07:00:03 | 3 | A
2020-11-04 07:00:04 | 4 | A
2020-11-03 07:00:01 | 1 | B
2020-11-03 07:00:02 | 2 | B
2020-11-03 07:00:03 | 3 | B
2020-11-03 07:00:04 | 4 | B
... ... ...
#spacing added for clarity
What appears to be happening is the start and end time vectors are being recycled. So it compares the first row of the dataframe to the first items in the start and end time vectors, then the second row is compared to the second items in the start and end time vectors, and so on. Every few rows it does get a match.
What I need to happen is each row of the dataframe is compared to all three datetime pairs before moving onto the next row in the dataframe. It almost seems like purrr
would be the answer here.