1
votes

I have the following data frames

df1

ID    PMT_DATE
100   2015/01/01
100   2015/02/01
100   2015/04/01 
200   2016/01/01
200   2016/02/01

and df2

ID    DATE         STATUS    
100   2014/12/31   A
100   2015/03/15   B
200   2015/12/31   A  
200   2016/06/01   C

I would create a data frame which is df1 with the STATUS column from df2. If the PMT_DATE column in df1 is greater than or equal to the DATE column in df2, the associated status in df2 should be placed in the new data frame. The resulting dataframe should look like this

ID    PMT_DATE     STATUS
100   2015/01/01   A
100   2015/02/01   A
100   2015/04/01   B 
200   2016/01/01   A
200   2016/02/01   A

Normally I would join the two tables, create a new column and perform calculations with mutate and get rid of the columns I don't need anymore, but since there are multiple matches in the ID column in df1 and df2, I can't quite implement that strategy.

EDIT: For multiple matches, I want the latest status. For instance, the last row with ID == 100 would fall under both Status == A and Status == B, but I only want status B. Also, the ID field in both dataframes represents the same thing (i.e. a join by ID is what I want).

I was thinking of something along the lines of

new_df <- df1 %>% rowwise() $>% do() ...

But I am not sure how to fill in the rest to achieve what I need.

2
Shouldn't the third row of df1 have a status of A and B?bouncyball

2 Answers

1
votes

Here is a way using dplyr. The idea is to join the table first and filter out the row with PMT_DATE >= DATE. I think you only want the latest STATUS (STATUS associate with the latest DATE).

library(dplyr)

df1 %>% 
  left_join(df2, by="ID") %>%
  filter(PMT_DATE >= DATE) %>% 
  group_by(ID, PMT_DATE) %>% 
  slice(n()) %>% # get the latest status
  select(-DATE) %>%
  ungroup()

# # A tibble: 5 x 3
#      ID   PMT_DATE STATUS
#   <int>      <chr>  <chr>
# 1   100 2015/01/01      A
# 2   100 2015/02/01      A
# 3   100 2015/04/01      B
# 4   200 2016/01/01      A
# 5   200 2016/02/01      A
2
votes

I'm not sure if rolling joins are available with dplyr yet. This is what I would do with data.table to get the latest STATUS:

library(data.table)
setDT(df2)[setDT(df1), on = .(ID, DATE = PMT_DATE), roll = Inf]
    ID       DATE STATUS
1: 100 2015-01-01      A
2: 100 2015-02-01      A
3: 100 2015-04-01      B
4: 200 2016-01-01      A
5: 200 2016-02-01      A

Data

df1 <- readr::read_table(
  "ID    PMT_DATE
100   2015/01/01
100   2015/02/01
100   2015/04/01 
200   2016/01/01
200   2016/02/01"
)
df2 <- readr::read_table(
  "ID    DATE         STATUS    
100   2014/12/31   A
100   2015/03/15   B
200   2015/12/31   A  
200   2016/06/01   C"
)