1
votes

I would like to identify in which year individuals in a panel data set are observed and register the information in another variable. Individuals may be observed over more successive years OR with gaps over one or more years whereupon consecutive yearly observations may follow.

ID 1 in the df below, for instance, is observed in 2000 and 2001, while ID 2 is observed in 2000 and 2002, with a gap in 2001.

df = data.table(Year = c(2000, 2000, 2001, 2001, 2002, 2002), ID = c(1,2, 1,3,2,3 ), V1 = rep("", 6))

df

Year | ID | V1

2000 | 1 |

2000 | 2 |

2001 | 1 |

2001 | 3 |

2002 | 2 |

2002 | 3 |

My wished outpout in V1 then contains for each ID a chain of the observed years:

Year | ID | V1

2000 | 1 | 00/01

2000 | 2 | 00/02

2001 | 1 | 00/01

2001 | 3 | 01/02

2002 | 2 | 00/02

2002 | 3 | 01/02

Or better, as the information is not important for each single observations of the ID: the information of observed years only for the first observation of each ID.

Year | ID | V1

2000 | 1 | 00/01

2000 | 2 | 00/02

2001 | 1 |

2001 | 3 | 01/02

2002 | 2 |

2002 | 3 |

Thanks for any hint!

1
It is not clear what you are planning to do with this information. A possibly more useful solution for future computation is to save the years of observation for each ID into a separate named list. object, like this: split(df$Year, df$ID).lmo
My idea was to easily see the frequencies for each period. That is, supposing I got the desired output for V1, by using table(df$V1) (given V1 is like in the second table above) I could easily obtain the numbers of observations for each period for which the IDs where observed. This is the scope for creating V1.Enrico
And also to assign to each ID the information for which points in time or time periods they were observed, in order to easily create sub data sets, differentiating IDs with respect to observed periods.Enrico
For actual computation, I suspect that the named list that I suggest could be more useful than storing pasted values in a single variable. For example, lengths(split(df$Year, df$ID)) or sapply(split(df$Year, df$ID), length) provides observation counts for each ID as a named vector, where the names are the ID values.lmo
Thank you very much, that are really usefull commands for my purpose!!Enrico

1 Answers

0
votes

Here's a way using by and the paste function. We create a summary table and then merge it back to your original table. I've left the full year value (rather than the substring), but that can be easily accomplished if it is needed.

df2 <- stack(by(df, df$ID, function(d) paste(d$Year, collapse = '/')),
             stringsAsFactors = FALSE)
df2$ind <- as.numeric(as.character(df2$ind)) #convert back to numeric
merge(df, df2, by.x = 'ID', by.y = 'ind')

   ID Year V1    values
1:  1 2000    2000/2001
2:  1 2001    2000/2001
3:  2 2000    2000/2002
4:  2 2002    2000/2002
5:  3 2001    2001/2002
6:  3 2002    2001/2002

To get just one value for the first observation for an ID, we can use the dplyr package:

library(dplyr)
merge(df, df2, by.x = 'ID', by.y = 'ind') %>%
    select(-V1) %>%
    group_by(ID) %>%
    mutate(values = ifelse(Year == min(Year), values, ''))

     ID  Year    values
  <dbl> <dbl>     <chr>
1     1  2000 2000/2001
2     2  2000 2000/2002
3     1  2001          
4     3  2001 2001/2002
5     2  2002          
6     3  2002