In R, use mutate() to create a new column based on conditions by group

Question

For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.

p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)

So now the dataset looks like this.

> d
   p type       date
1  1   15 2014-02-03
2  1   20 2014-02-04
3  1   20 2014-02-06
4  2   15 2014-01-28
5  2   20 2014-02-03
6  2   15 2014-03-03
7  2   20 2014-03-13
8  3   20 2014-04-03
9  3   15 2014-04-09
10 3   15 2014-12-03

Now, I'd like to create three new columns.

indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)
What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)
How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)

I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.

group_by(p)%>%
for(i in i:length(date)){
  *if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
  indicator = 1
  first_date = 
  days =* #Have no idea how to check in this part
} else {
  indicator = 0
  first_date = NA
  days = NA
}

The expected output is as below.

   p type       date ind first_date days
1  1   15 2014-02-03   1 2014-02-04    1 # = 2014-02-04 - 2014-02-03  
2  1   20 2014-02-04  NA       <NA>   NA
3  1   20 2014-02-06  NA       <NA>   NA
4  2   15 2014-01-28   1 2014-02-03    6 # = 2014-02-03 - 2014-01-28   
5  2   20 2014-02-03  NA       <NA>   NA
6  2   15 2014-03-03   0       <NA>   NA # since (2014-03-13 - 2014-03-03) > 7   
7  2   20 2014-03-13  NA       <NA>   NA
8  3   20 2014-04-03  NA       <NA>   NA #I don't care about the value for type 20 lines
9  3   15 2014-04-09   0       <NA>   NA
10 3   15 2014-12-03   0       <NA>   NA

So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.

I found one way in doing this:

 d[,group:= cumsum(type ==15)]

However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?

You should ask one question at a time here. See stackoverflow.com/help/how-to-ask — Frank
@Axeman Thanks! I struggled a little bit about it. Another hint showing that I know to little about R. sigh — Megan Wenjie Song
@rawr yes, that is the case for p2. He has multiple 15 dates. and we need to check for each type 15 visit. Let me try your method. Thanks! Why R is so hard ; ( — Megan Wenjie Song

yeedle yeedle · Accepted Answer · 2017-03-20T19:32:37

I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:

library(dplyr)
library(tidyr)
library(lubridate)

d %>% 
  mutate(rownum = 1:n()) %>%
  spread(type, date, sep="_")  %>% 
  group_by(p) %>%
  mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
  mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
  mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
  mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
  mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
  mutate(days = first_date - type_15) %>%
  gather("type", "date", type_15, type_20) %>% 
  filter(!is.na(date)) %>% 
  arrange(p, date) %>%
  select(p, type, date, ind, first_date, days)

#       p    type       date   ind first_date    days
#   <dbl>   <chr>     <date> <dbl>     <date>  <time>
#1      1 type_15 2014-02-03     1 2014-02-04  1 days
#2      1 type_20 2014-02-04    NA       <NA> NA days
#3      1 type_20 2014-02-06    NA       <NA> NA days
#4      2 type_15 2014-01-28     1 2014-02-03  6 days
#5      2 type_20 2014-02-03    NA       <NA> NA days
#6      2 type_15 2014-03-03     0       <NA> NA days
#7      2 type_20 2014-03-13    NA       <NA> NA days
#8      3 type_20 2014-04-03    NA       <NA> NA days
#9      3 type_15 2014-04-09     0       <NA> NA days
#10     3 type_15 2014-12-03     0       <NA> NA days

Let me try to explain what I'm doing:

First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.

The next three mutate lines add the columns outlined in 2 and 3 in the question.

Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.

I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.

In R, use mutate() to create a new column based on conditions by group

3 Answers