0
votes

I have three columns in my data set:

  1. The date when Item X was purchased
  2. The date when Item Y was purchased
  3. The date when Item Z was purchased

Sample Data:

 set.seed(100)
 Y   <- sample(seq(as.Date('2016-01-01'), as.Date('2017-09-30'), by="day"), 100)
 Y1  <- sample(seq(as.Date('2016-01-01'), as.Date('2017-03-31'), by="day"), 100)
 Y2  <- sample(seq(as.Date('2016-01-01'), as.Date('2017-03-31'), by="day"), 100)
 X1 <- sample(1:183, 100, replace = T)
 Z1 <- sample(1:183, 100, replace = T)

 X  <- Y1 - X1
 Z  <- Y2 + Z1

 dat <- data.frame("X"=X, "Y"=Y, "Z"=Z)

What I want to do is to create a dataset such that:

  1. If Y is purchased, look back within 6 months and find all those rows with most recent dates where X were purchased. Also, look forward from that date of Y purchase and find all the most recent dates when Z were purchased.
  2. If any of the dates are not available within 6 months range, select as NA

So that sample output would look like:

 2017-04-10 2017-05-08  2017-06-19
 2017-01-01 2017-05-08  2017-09-30
 .
 .
 .
 2017-04-10 2017-06-10  2016-06-19
 NA         2017-05-08  2017-09-30
 .
 .
 .

Here, most recent date means (for example): If bought Y on date June 30, 2017 then from June 30 what was the most recent date when X was bought (lets say it was May 15, 2017) and the most recent date when Z was bought (lets sat it was July 21, 2017). So for X it should be look backward within 6 months max and for Z, ti should be Look Forward into 6 months max.

Logic is easy, but I would prefer a way out in r-SQL or dplyr, if possible.

1
What attempts didn't work (i.e. where's the code that you tried)?hrbrmstr
most recent dates ..._which_ most recent dates? Also, this might be a very ugly query in MySQL, which doesn't support analytic functions.Tim Biegeleisen
most recent means if bought Y on date June 30, 2017 then from June 30 what was the most recent date when X was bought (lets say it was May 15, 2017) and the most recent date when Z was bought (lets sat it was July 21, 2017). So for X it should be look backward within 6 months max and for Z, ti should be Look Forward into 6 months max.I_m_LeMarque
I would suggest a revised model (but still 3 columns): the date, the item purchased, the id that relates items x, y, and z (if, in your previous model, dates are not unique)Strawberry

1 Answers

1
votes
library(dplyr)


# repeat all x and z dates per y
bind_cols(
    select(dat, Y) %>% slice(rep(row_number(), n())),
    select(dat, -Y) %>% slice(rep(row_number(), each = n()))) %>% 
  distinct() %>%
  # calculate date differences and keep if within 6 month bounds
  mutate(XYdiff = as.numeric(difftime(X, Y, units = "days")),
         ZYdiff = as.numeric(difftime(Z, Y, units = "days"))) %>%
  mutate(X = if_else(between(XYdiff, -180, 0), X, as.Date(NA)),
         Z = if_else(between(ZYdiff, 0, 180), Z, as.Date(NA))) %>% 
  # for each y, get the closest dates
  group_by(Y) %>%
  summarise(X = max(X, na.rm = TRUE),
            Z = min(Z, na.rm = TRUE)) %>% 
  # add in any Y with NA for both X and Z
  tidyr::complete(Y, fill = list(X = NA, Y = NA))