0
votes

Ciao, Here is my replicating example.

    a=c(1,2,3,4,5,6)
    a1=c(15,17,17,16,14,15)
    a2=c(0,0,1,1,1,0)
    b=c(1,0,NA,NA,0,NA)
    c=c(2010,2010,2010,2010,2010,2010)
    d=c(1,1,0,1,0,NA)
    e=c(2012,2012,2012,2012,2012,2012)
    f=c(1,0,0,0,0,NA)
    g=c(2014,2014,2014,2014,2014,2014)
    h=c(1,1,0,1,0,NA)
    i=c(2010,2012,2014,2012,2014,2014)
    mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
    names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
    mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
    x1 = mydata2 %>% 
      group_by(id) %>% 
      slice(which(drop==1)[1])
    x2 = mydata2 %>% 
      group_by(id) %>% 
      slice(which(drop==0)[1])

I have data "mydata2" which is tall such that every ID has many rows.

I want to make new data set "x" such that every ID has one row that is based on if they drop or not. The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.

Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.

This is desired output, thank you to @Wimpel This is desired output

2

2 Answers

1
votes

First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.

library(dplyr)
mydata2 %>% group_by(id) %>% 
            mutate(dropY=first(year[!is.na(drop) & drop==1]), 
                   dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
            slice(1)


#Update
mydata2 %>% group_by(id) %>% 
            mutate(dropY=first(year[!is.na(drop) & drop==1]), 
                   dropYEAR=if_else(is.na(dropY), last(year),dropY), 
                   didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
                                     TRUE ~ 0)) %>%
            select(-dropY) %>% slice(1)

# A tibble: 6 x 9
# Groups:   id [6]
       id   age gender  Year  year  drop     X dropYEAR didDROP
    <dbl> <dbl>  <dbl> <int> <dbl> <dbl> <int>    <dbl>   <dbl>
1     1    15      0     1  2010     1     1     2010       1
2     2    17      0     1  2010     0     2     2012       1
3     3    17      1     1  2010    NA     3     2014       0
4     4    16      1     1  2010    NA     4     2012       1
5     5    14      1     1  2010     0     5     2014       0
6     6    15      0     1  2010    NA     6     2014       0

I hope this what you're looking for.

1
votes

You can sort by id, drop and year, conditionally on dropping or not:

library(dplyr)
mydata2 %>%
  mutate(drop=ifelse(is.na(drop),0,drop)) %>%
  arrange(id,-drop,year*(2*drop-1)) %>%
  group_by(id) %>%
  slice(1) %>%
  select(id,age,gender,didDROP=drop,dropYEAR=year)

# A tibble: 6 x 5
# Groups:   id [6]
     id   age gender didDROP dropYEAR
  <dbl> <dbl>  <dbl>   <dbl>    <dbl>
1     1    15      0       1     2010
2     2    17      0       1     2012
3     3    17      1       0     2014
4     4    16      1       1     2012
5     5    14      1       0     2014
6     6    15      0       0     2014