0
votes

I have a very large data set (>12,000 rows) that contains two vectors of the start and end date of a measurement (relative growth rate). Each row has an individual, the date of the initial measurement, the date of the final measurement, and then the growth rate associated with that window. I want to count the number of days within each growth measurement that match dates that are in a separate vector. So ideally I would be adding a vector to the data frame that gives me a count of the number of matching days between the dates in the df and the external vector for each individual's growth rate.

Sample of what the data frame looks like, but in actuality it is much, much larger

individual <- c(1:5)
initialdate <- c("1990-03-01", "1991-05-07", "1992-05-02",
                 "1994-03-02", "1995-03-08")
enddate <- c("1991-08-05", "1993-02-17", "1995-06-22", 
             "1995-09-12", "1997-08-18")
RGR <- c(0.25, 0.3, 0.09, 0.48, 0.34)
df <- data.frame(individual, initialdate, enddate, RGR)

DaysToMatch <- c("1991-03-30", "1993-12-18", ... ) 

countDays <- function(df$initialdate, df$enddate, DaysToMatch) { 
    measurementwindow <- seq(as.Date(initialdate), as.Date(enddate), by="days") 
    commonDays <- length(intersect(measurementwindow, DaysToMatch)) 
    return(commonDays) 
} 

Error in seq.Date(as.Date(InitialDate), as.Date(ObsrvnDate), by = "days") : 'from' must be of length 1

I also suspect that I am not indexing the vectors properly to calculate the commonDays for each row of the data frame.

2

2 Answers

0
votes

You could convert DaysToMatch to date and call countDays function using mapply

DaysToMatch <- as.Date(c("1991-03-30", "1993-12-18"))

countDays <- function(initialdate, enddate, DaysToMatch) {
   measurementwindow <- seq(as.Date(initialdate), as.Date(enddate), by="days")
   commonDays <- length(intersect(measurementwindow, DaysToMatch))
   #Or sum(DaysToMatch %in% measurementwindow)
   return(commonDays)
}

df$commonDays <- mapply(countDays, df$initialdate, df$enddate, 
                 MoreArgs = list(DaysToMatch = DaysToMatch))

df
#  individual initialdate    enddate  RGR commonDays
#1          1  1990-03-01 1991-08-05 0.25          1
#2          2  1991-05-07 1993-02-17 0.30          0
#3          3  1992-05-02 1995-06-22 0.09          1
#4          4  1994-03-02 1995-09-12 0.48          0
#5          5  1995-03-08 1997-08-18 0.34          0

However, we can also use dplyr::between to calculate number of values between two boundaries.

library(dplyr)
library(purrr)

df %>%
  mutate_at(vars(ends_with("Date")), as.Date) %>%
  mutate(commonDays = map2_int(initialdate, enddate, 
                      ~sum(dplyr::between(DaysToMatch, .x, .y))))
0
votes

We can use data.table join in an efficient and compact way

library(data.table)
n <-  data.table(date = DaysToMatch)[df, .N, 
      on = .(date > initialdate, date <= enddate), by = .EACHI]$N
df$commonDays <- n
df
#  individual initialdate    enddate  RGR commonDays
#1          1  1990-03-01 1991-08-05 0.25          1
#2          2  1991-05-07 1993-02-17 0.30          0
#3          3  1992-05-02 1995-06-22 0.09          1
#4          4  1994-03-02 1995-09-12 0.48          0
#5          5  1995-03-08 1997-08-18 0.34          0

Or using tidyverse

library(tidyverse)
df %>% 
   rowwise %>% 
   mutate(commonDays = sum(between(DaysToMatch, initialdate, enddate)))
# A tibble: 5 x 5
#  individual initialdate enddate      RGR commonDays
#       <int> <date>      <date>     <dbl>      <int>
#1          1 1990-03-01  1991-08-05  0.25          1
#2          2 1991-05-07  1993-02-17  0.3           0
#3          3 1992-05-02  1995-06-22  0.09          1
#4          4 1994-03-02  1995-09-12  0.48          0
#5          5 1995-03-08  1997-08-18  0.34          0

data

df[2:3] <- lapply(df[2:3], as.Date)
DaysToMatch <- as.Date(c("1991-03-30", "1993-12-18"))