assign new grouping variable based on time series interval from other dataframe

Question

I'm a relative novice in R and am struggling with the following. I have one dataframe with a column of CO2 concentrations measured every second and a column with date-time (POSIXct), and a second dataframe with "start" and "stop" date-time. What I would like to do is assign a grouping variable (e.g. ascending numbers) in a new column to the dataframe with the CO2 concentrations based on the start/stop times of the second dataframe.

For example: start = 13:30 stop=13:33, so all the measured CO2 concentrations get grouping variable '1' if they fall in between the start/stop time.

As there is time between the first row of start/stop times and the second row, there are also many CO2 measurements which should get "NA" as the grouping variable.

Here is a subset of the start/stop data:

times <- structure(list(Start = structure(c(1591266360, 1591266960), class = c("POSIXct",  "POSIXt"), tzone = ""), Stop = structure(c(1591266540, 1591267140 ), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 1:2, class = "data.frame")

And as the dataframe of the CO2 concentrations is rather large I've put the output in a text file: CO2 dataframe subset.

This is the first time asking a question here (as most of my previous questions were already asked before), so I apologise in advance if things are unclear.

This is similar stackoverflow.com/questions/24480031/… Or stackoverflow.com/questions/62912260/… — Ronak Shah
Ronak Shah, you're right! I was thinking in the wrong direction, which made my search unsuccessful. I did not master the art of searching yet I suppose (searched for hours), or I didn't recognize the solutions as such. — Thomas

Edo Edo · Accepted Answer · 2020-08-13T09:46:13

Based on the link I left you in the comments, here is your solution.

Your data:

times <- structure(list(Start = structure(c(1591266360, 1591266960), class = c("POSIXct",  "POSIXt"), tzone = ""), Stop = structure(c(1591266540, 1591267140 ), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 1:2, class = "data.frame")
df <- eval(parse("df.text"))

Solution:

library(dplyr)
library(fuzzyjoin)

# define a group per each row before joining
times <- times %>%
  mutate(group = row_number())


# fuzzy join! 
fuzzy_left_join(
  df, times,
  by = c("dt" = "Start", "dt" = "Stop"),
  match_fun = list(`>=`, `<=`)  # here you specify what function to use for the join!
)

fuzzyjoin looks like a pretty cool package. It allows you to do this kind of particular joins that dplyr lacks of.

assign new grouping variable based on time series interval from other dataframe

1 Answers