I have a tibble of survey data. Each row represents an interview with a "respondent." Some respondents were interviewed once; others were interviewed multiple times. I want to select, from this tibble, only the last interview for each respondent.
Here is a minimal example:
tmp <- tribble(
~YYYYMM, ~ID, ~DATEPR, ~IDPREV,
198001, 1, NA, NA,
198001, 2, NA, NA,
198001, 3, NA, NA,
198002, 1, 198001, 1,
198002, 2, NA, NA,
198002, 3, NA, NA,
198003, 1, 198002, 1,
198003, 2, NA, NA,
198003, 3, 198002, 3)
where
YYYYMM
is the date of an interview.DATEPR
is the date of the respondent's previous interview, if there was one.ID
is unique only within interview waves. This means, for example, that the respondent for whomID==2
andYYYMM==198001
need not be the respondent for whomID==2
andYYYMM==198002
.IDPREV
is the ID of the respondent's previous interview, if there was one.
There are nine rows in the tibble above. But one respondent was interviewed three times, and another was interviewed twice. I want only the last interview for each respondent, so I want a tibble with only six rows. This code does the job:
for (i in 1:nrow(tmp)) {
if (!is.na(tmp$DATEPR[i])) {
ind <- which(tmp$YYYYMM == tmp$DATEPR[i] & tmp$ID == tmp$IDPREV[i])
tmp <- tmp[-ind, ]
}
}
But it seems a little hard to parse. Is there a clearer way to achieve the same end with the tidyverse functions? I have in mind a two–step function: first, get the indices of all rows to remove; second, remove the rows. But I haven't been able to implement this solution with map
or with the dplyr
functions.