1
votes

For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.

p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)

So now the dataset looks like this.

> d
   p type       date
1  1   15 2014-02-03
2  1   20 2014-02-04
3  1   20 2014-02-06
4  2   15 2014-01-28
5  2   20 2014-02-03
6  2   15 2014-03-03
7  2   20 2014-03-13
8  3   20 2014-04-03
9  3   15 2014-04-09
10 3   15 2014-12-03

Now, I'd like to create three new columns.

  1. indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)

  2. What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)

  3. How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)

I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.

group_by(p)%>%
for(i in i:length(date)){
  *if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
  indicator = 1
  first_date = 
  days =* #Have no idea how to check in this part
} else {
  indicator = 0
  first_date = NA
  days = NA
}

The expected output is as below.

   p type       date ind first_date days
1  1   15 2014-02-03   1 2014-02-04    1 # = 2014-02-04 - 2014-02-03  
2  1   20 2014-02-04  NA       <NA>   NA
3  1   20 2014-02-06  NA       <NA>   NA
4  2   15 2014-01-28   1 2014-02-03    6 # = 2014-02-03 - 2014-01-28   
5  2   20 2014-02-03  NA       <NA>   NA
6  2   15 2014-03-03   0       <NA>   NA # since (2014-03-13 - 2014-03-03) > 7   
7  2   20 2014-03-13  NA       <NA>   NA
8  3   20 2014-04-03  NA       <NA>   NA #I don't care about the value for type 20 lines
9  3   15 2014-04-09   0       <NA>   NA
10 3   15 2014-12-03   0       <NA>   NA

So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.

I found one way in doing this:

 d[,group:= cumsum(type ==15)]

However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?

3
You should ask one question at a time here. See stackoverflow.com/help/how-to-askFrank
Just use data.frame(), instead of data.frame(cbind())Axeman
@Frank Question 2 is the hardest.Megan Wenjie Song
@Axeman Thanks! I struggled a little bit about it. Another hint showing that I know to little about R. sighMegan Wenjie Song
@rawr yes, that is the case for p2. He has multiple 15 dates. and we need to check for each type 15 visit. Let me try your method. Thanks! Why R is so hard ; (Megan Wenjie Song

3 Answers

1
votes

I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:

library(dplyr)
library(tidyr)
library(lubridate)

d %>% 
  mutate(rownum = 1:n()) %>%
  spread(type, date, sep="_")  %>% 
  group_by(p) %>%
  mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
  mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
  mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
  mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
  mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
  mutate(days = first_date - type_15) %>%
  gather("type", "date", type_15, type_20) %>% 
  filter(!is.na(date)) %>% 
  arrange(p, date) %>%
  select(p, type, date, ind, first_date, days)

#       p    type       date   ind first_date    days
#   <dbl>   <chr>     <date> <dbl>     <date>  <time>
#1      1 type_15 2014-02-03     1 2014-02-04  1 days
#2      1 type_20 2014-02-04    NA       <NA> NA days
#3      1 type_20 2014-02-06    NA       <NA> NA days
#4      2 type_15 2014-01-28     1 2014-02-03  6 days
#5      2 type_20 2014-02-03    NA       <NA> NA days
#6      2 type_15 2014-03-03     0       <NA> NA days
#7      2 type_20 2014-03-13    NA       <NA> NA days
#8      3 type_20 2014-04-03    NA       <NA> NA days
#9      3 type_15 2014-04-09     0       <NA> NA days
#10     3 type_15 2014-12-03     0       <NA> NA days

Let me try to explain what I'm doing:

First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.

The next three mutate lines add the columns outlined in 2 and 3 in the question.

Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.

I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.

1
votes

If you're willing to use some functions from the purrr package and to use some custom functions, here is another option...

Packages you'll need

library(dplyr)
library(purrr)

Set up data (as per question)

p <-c(1,1,1,2,2,2,2,3,3,3)
type <- c(15,20,20,15,20,15,20,20,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03"))
d <- data.frame(cbind(p,type,date))
d$date = as.Date(date)

Create custom functions that will work with the purrr map_* functions to iterate through your data frame and create ind and first_date.

# Function to manage ind
ind_manager <- function(type, date, dates_20) {
  if (type == 20)
    return (NA_integer_)

  checks <- map_lgl(dates_20, between, date, date + 7)
  return (as.integer(any(checks)))
}

# Function to manage first_date
first_date_manager <- function(ind, date, dates_20) {
  if (is.na(ind) || ind != 1)
    return (NA_character_)

  dates_20 <- dates_20[order(dates_20)]
  as.character(dates_20[which.max(date < dates_20)])
}

Save a vector of dates where type == 20 to be used as comparisons

dates_20 <- d$date[d$type == 20]

The final mutate() call

# mutate() call to create variables
d %>% 
  mutate(
    ind = map2_int(type, date, ind_manager, dates_20),
    first_date = as.Date(map2_chr(ind, date, first_date_manager, dates_20)),
    days = as.integer(first_date - date)
  )
#>    p type       date ind first_date days
#> 1  1   15 2014-02-03   1 2014-02-04    1
#> 2  1   20 2014-02-04  NA       <NA>   NA
#> 3  1   20 2014-02-06  NA       <NA>   NA
#> 4  2   15 2014-01-28   1 2014-02-03    6
#> 5  2   20 2014-02-03  NA       <NA>   NA
#> 6  2   15 2014-03-03   0       <NA>   NA
#> 7  2   20 2014-03-13  NA       <NA>   NA
#> 8  3   20 2014-04-03  NA       <NA>   NA
#> 9  3   15 2014-04-09   0       <NA>   NA
#> 10 3   15 2014-12-03   0       <NA>   NA
1
votes

Here is a base R way. Generally, I prefer to create a function that does your task which can then be repeated on other pieces and debugged on test cases where it doesn't seem to work.

The first step is to define the pieces:

d <- structure(list(p = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
                    type = c(15, 20, 20, 15, 20, 15, 20, 20, 15, 15),
                    date = structure(c(16104, 16105, 16107, 16098, 16104, 16132, 16142, 16163, 16169, 16407), class = "Date")),
               .Names = c("p", "type", "date"),
               row.names = c(NA, -10L), class = "data.frame")

id <- with(d, {
  id <- ave(type, p, FUN = function(x) cumsum(x == 15))
  factor(paste0(p, id), unique(paste0(p, id)))
})

sp <- split(d, id)

So, sp creates a list of data frames to which we will apply a function. Each piece is a single unique p with at most one type == 15 (plus however many type == 20s follow.

The first two pieces are

sp[1:2]

# $`11`
#   p type       date
# 1 1   15 2014-02-03
# 2 1   20 2014-02-04
# 3 1   20 2014-02-06
# 
# $`21`
#   p type       date
# 4 2   15 2014-01-28
# 5 2   20 2014-02-03

And we can apply the function below on each one

first_date(sp[[1]])

#   p type       date ind first_date days
# 1 1   15 2014-02-03   1 2014-02-04    1
# 2 1   20 2014-02-04  NA       <NA>   NA
# 3 1   20 2014-02-06  NA       <NA>   NA

first_date(sp[[2]])

#   p type       date ind first_date days
# 4 2   15 2014-01-28   1 2014-02-03    6
# 5 2   20 2014-02-03  NA       <NA>   NA

Or all at once with a loop

(sp1 <- lapply(sp, first_date))
`rownames<-`(do.call('rbind', sp1), NULL)

#    p type       date ind first_date days
# 1  1   15 2014-02-03   1 2014-02-04    1
# 2  1   20 2014-02-04  NA       <NA>   NA
# 3  1   20 2014-02-06  NA       <NA>   NA
# 4  2   15 2014-01-28   1 2014-02-03    6
# 5  2   20 2014-02-03  NA       <NA>   NA
# 6  2   15 2014-03-03   0       <NA>   NA
# 7  2   20 2014-03-13  NA       <NA>   NA
# 8  3   20 2014-04-03  NA       <NA>   NA
# 9  3   15 2014-04-09   0       <NA>   NA
# 10 3   15 2014-12-03   0       <NA>   NA

You can take advantage of the arguments, like window, or any others you add without changing much of the function, for example, to change the window

(sp2 <- lapply(sp1, first_date, window = 14))
`rownames<-`(do.call('rbind', sp2), NULL)

#    p type       date ind first_date days ind first_date days
# 1  1   15 2014-02-03   1 2014-02-04    1   1 2014-02-04    1
# 2  1   20 2014-02-04  NA       <NA>   NA  NA       <NA>   NA
# 3  1   20 2014-02-06  NA       <NA>   NA  NA       <NA>   NA
# 4  2   15 2014-01-28   1 2014-02-03    6   1 2014-02-03    6
# 5  2   20 2014-02-03  NA       <NA>   NA  NA       <NA>   NA
# 6  2   15 2014-03-03   0       <NA>   NA   1 2014-03-13   10
# 7  2   20 2014-03-13  NA       <NA>   NA  NA       <NA>   NA
# 8  3   20 2014-04-03  NA       <NA>   NA  NA       <NA>   NA
# 9  3   15 2014-04-09   0       <NA>   NA   0       <NA>   NA
# 10 3   15 2014-12-03   0       <NA>   NA   0       <NA>   NA

first_date <- function(data, window = 7) {
  nr <- nrow(data)

  ## check at least one type 15 and > 1 row
  ty15 <- data$type == 15
  dt15 <- data$date[ty15]

  if (!any(ty15) | nr == 1L)
    return(cbind(data, ind = ifelse(any(ty15), 0, NA),
                 first_date = NA, days = NA))

  ## first date vector
  dts <- rep(min(data$date[!ty15]), nr)
  dts[!ty15] <- NA

  ## days from the type 15 date
  days <- as.numeric(data$date[!ty15] - min(dt15))
  days <- c(days, rep(NA, nr - length(days)))

  ## convert to NA if criteria not met
  to_na <- days > window | is.na(dts)
  days[to_na] <- dts[to_na] <- NA

  ## ind vector -- 1 or 0 if type 15, NA otherwise
  ind <- rep(NA, nr)
  ind[ty15] <- as.integer(!is.na(dts[ty15]))

  ## combine
  cbind(data, ind = ind, first_date = dts, days = days)
}