2
votes

I'm trying to create new variables with mutate in dplyr and I can't understand my error, I've tried everything and have not stumbled upon this issue in the past.

I have a large data set, over a million observations. I only provide you with the 20 first observations.

This is how my data looks like:

data1 <- read.table(header=TRUE, text="IDnr  visit  time    year    end event   survival

    7   1   04/09/06    2006    31/12/06    0   118
    7   2   04/09/06    2007    31/12/07    0   483
    7   3   04/09/06    2008    31/12/08    0   849
    7   4   04/09/06    2009    31/12/09    0   1214
    7   5   04/09/06    2010    31/12/10    0   1579
    7   6   04/09/06    2011    31/12/11    0   1944
    20  1   24/10/03    2003    31/12/03    0   68
    20  2   24/10/03    2004    31/12/04    0   434
    20  3   24/10/03    2005    31/12/05    0   799
    20  4   24/10/03    2006    31/12/06    0   1164
    20  5   24/10/03    2007    31/12/07    0   1529
    20  6   24/10/03    2008    31/12/08    0   1895
    20  7   24/10/03    2009    31/12/09    0   2260
    20  8   24/10/03    2010    31/12/10    0   2625
    20  9   24/10/03    2011    31/12/11    0   2990
    87  1   17/01/06    2006    31/12/06    0   348
    87  2   17/01/06    2007    31/12/07    0   713
    87  3   17/01/06    2008    31/12/08    0   1079
    87  4   17/01/06    2009    31/12/09    0   1444
    87  5   17/01/06    2010    31/12/10    0   1809")

I must say that the date and time variables does not have this format in my dataset, I't is coded with POSIXct with the format ("%Y-%m-%d"). I't somehow reformats itself when I attach I't to stackoverflow and apply the "code" citations.

Okey, the problem is that I'm trying to create new survival time variables in the same dataset, one is for a cox regression model with stop and start time (survival is stop time and the new start variable should be called survcox).

Also im trying to do a poisson regression where the offset variable (i.e the survival time variable) should be called survpois. This is the code I'm trying to use;

  data2 <- data1 %>%
  group_by(IDnr) %>%
  mutate(survcox = ifelse(visit==1, 0, lag(survival)),
         year_aar = substr(data1$year, 1,4), first_day = as.POSIXct(paste0(year_aar, "-01-01-")),
         survpois = as.numeric(data1$end - first_day)+1) %>%
      mutate(survpois = ifelse(year_aar > first_day, as.numeric(end - year_aar),
                           survpois)) %>%
  ungroup()

I receive an error in this step!

Error: incompatible size (1345000), expecting 6 (the group size) or 1

I have no idea why I get this error, what I't means and why my code doesn't work.

All the help I can get is appreciated, thanks in advance!

2
year_aar = substr(data1$year, 1,4) seems to be returning an error. You probably meant year_aar = substr(year, 1,4) It seems that you have more things going on.jazzurro

2 Answers

1
votes

It's because you reference variable as data1$year which doesn't fit in grouped data (and in data1$end too)

1
votes

I teased apart your code and found a few issues. One was the thing I mentioned in the comment above. Second thing was the class of end. If the data you provided is the one, end is factor. If this is the case in your own situation, you need to convert end to an date object. The other thing was year_aar > first_day. first_day is a date object whereas year_arr is character. Given those, I modified your code.

data1 %>%
group_by(IDnr) %>%
mutate(survcox = ifelse(visit == 1, 0, lag(survival)),
       year_aar = substr(year, 1,4),
       first_day = as.POSIXct(paste0(year_aar, "-01-01-")),
       survpois = as.numeric(as.POSIXct(end, format = "%d/%m/%y") - first_day) + 1) %>%
mutate(survpois = ifelse(as.numeric(year_aar) > as.numeric(format(first_day, "%Y")),
            as.numeric(as.POSIXct(end, format = "%d/%m/%y") - year_aar), survpois)) %>%
ungroup()

Here is a bit of the outcome.

#   IDnr visit     time year      end event survival survcox year_aar  first_day survpois
#1     7     1 04/09/06 2006 31/12/06     0      118       0     2006 2006-01-01      365
#2     7     2 04/09/06 2007 31/12/07     0      483     118     2007 2007-01-01      365
#3     7     3 04/09/06 2008 31/12/08     0      849     483     2008 2008-01-01      366
#4     7     4 04/09/06 2009 31/12/09     0     1214     849     2009 2009-01-01      365
#5     7     5 04/09/06 2010 31/12/10     0     1579    1214     2010 2010-01-01      365