3
votes

I've been struggling for a few days to solve the this task in R (I'm a former SAS user).

The setting/study - Observational data. Patients with Crohns Disease. Data was collected annually during 2002–2013. - Patients can be included any year and visits may be irregular on a annual basis. - I know the exact day of death for each patient. VARIABLE: DEATH_YEAR - I know the exact day of relapse (the endpoint of interest). VARIABLE: RELAPSE_YEAR

I am interested in the incidence of relapse and I need to calculate the number of relapses each year divided by the number of individuals alive that year. Now the problem is that from inclusion, individuals come irregularly, but I do know if they are actually alive that year and if they have experienced a relapse.

I could solve this if I could create 12 new variables for each patient. Each new variable should be the calendar year and this variable should be set to '1' if the patient is alive that year and has not yet experienced the event.

Thus the problem is that i need to create a 'year-variables' that are set to '1' for each year at inclusion and thereafter, given that the person is not dead, or has experienced the event.

An example: Patient X was included 2005 and died 2009. For him I would need he following variables: '2005', '2006', '2007', '2008' and '2009' set to '1'. Patient Y was included 2005 and experienced event 2007. For him I would need the following variables: '2005', '2006', 2007' set to '1'. (Yes, year of event/death need still be set to '1').

Here is how my data set looks:

data <- read.table(header = TRUE, text = "
patient     visit   first_visit relapse_year     death_year 
1          2003 2003    .   2010    
1          2004 2003    .   2010    
1          2009 2003    .   2010    
2          2002 2002    2006    .   
2          2006 2002    2006    .   
2          2006 2002    2006    .   
2          2008 2002    2006    .   
2          2012 2002    2006    .   
3          2004 2004    .   .   
3          2008 2004    .   .   
3          2008 2004    .   .
")

Here is the DESIRED data set

desired_data <- read.table(header = TRUE, text = "
patient     visit     first_visit   relapse_year    death_year YEAR2002     YEAR2003    YEAR2004    YEAR2005    YEAR2006    YEAR2007    YEAR2008    YEAR2009    YEAR2010    YEAR2011    YEAR2012
1          2003 2003    .   2010    .   1   1   1   1   1   1   1   1   .   .
1          2004 2003    .   2010    .   1   1   1   1   1   1   1   1   .   .
1          2009 2003    .   2010    .   1   1   1   1   1   1   1   1   .   .
2           2002    2002    2006    .   1   1   1   1   1   .   .   .   .   .   .
2          2006 2002    2006    .   1   1   1   1   1   .   .   .   .   .   .
2          2006 2002    2006    .   1   1   1   1   1   .   .   .   .   .   .
2          2008 2002    2006    .   1   1   1   1   1   .   .   .   .   .   .
2          2012 2002    2006    .   1   1   1   1   1   .   .   .   .   .   .
3          2004 2004    .   .   .   .   1   1   1   1   1   1   1   1   1
3          2008 2004    .   .   .   .   1   1   1   1   1   1   1   1   1
3          2008 2004    .   .   .   .   1   1   1   1   1   1   1   1   1
")

I would be extremely grateful for any advice on this! Thanks in advance!

1
Are you going to do a survival analysis after you get your desired data set? Because, if you do, you don't need to restructure your data that way.nograpes
What is the reason you wish to create such a table? Maybe something more succinct could accomplish the same thing?AGS
Hi, thanks for both replies. I am doing survival analysis but I (think) need that setup because I'm calculating absolute estimates (incidence rate per 1000 person years) and relative risk estimates (hazard ratios by means of Cox regression). Now, if I want to calculate the incidence each year (I'm examining temporal trends), then I'll need that setup, I believe. CheersFrank49
i.e., I need the number of persons at risk each year and the number of events that year to calculate incidence rates. Sorry for any confusion..Frank49
If that is all you need, you don't need this data structure.nograpes

1 Answers

2
votes

It's a bit hackish, but this will work. First turn your data into a numeric data frame so that the . turn into NA:

data0<-data.frame(lapply(data,function(x) as.numeric(as.character(x))))
head(data0)
#    patient visit first_visit relapse_year death_year
# 1        1  2003        2003           NA       2010
# 2        1  2004        2003           NA       2010
# 3        1  2009        2003           NA       2010
# 4        2  2002        2002         2006         NA
# 5        2  2006        2002         2006         NA
# 6        2  2006        2002         2006         NA

Then substitute 2012 (or whatever the last year is) for the NA values.

data0[is.na(data0)]<-2012

Now you can use pmin to determine how long until the patient dies/has a relapse/the experiment ends. The last thing to do is use arithmetic on column numbers to create the new dataset:

activeYears<-matrix(0,nrow(data0),11)
colnames(activeYears)<-2002:2012
startYear<-data0$first_visit[row(activeYears)]
endYear<-pmin(data0$relapse_year[row(activeYears)],data0$death_year[row(activeYears)])
colYear<-col(activeYears)+2001
activeYears[]<-startYear<=colYear & endYear>=colYear
activeYears
#      2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
# [1,]    0    1    1    1    1    1    1    1    1    0    0
# [2,]    0    1    1    1    1    1    1    1    1    0    0
# [3,]    0    1    1    1    1    1    1    1    1    0    0
# [4,]    1    1    1    1    1    0    0    0    0    0    0
# [5,]    1    1    1    1    1    0    0    0    0    0    0
# [6,]    1    1    1    1    1    0    0    0    0    0    0
# [7,]    1    1    1    1    1    0    0    0    0    0    0
# [8,]    1    1    1    1    1    0    0    0    0    0    0
# [9,]    0    0    1    1    1    1    1    1    1    1    1
#[10,]    0    0    1    1    1    1    1    1    1    1    1
#[11,]    0    0    1    1    1    1    1    1    1    1    1