1
votes

I have a tibble of survey data. Each row represents an interview with a "respondent." Some respondents were interviewed once; others were interviewed multiple times. I want to select, from this tibble, only the last interview for each respondent.

Here is a minimal example:

tmp <- tribble(
  ~YYYYMM, ~ID, ~DATEPR, ~IDPREV,
   198001,   1,      NA,      NA,
   198001,   2,      NA,      NA,
   198001,   3,      NA,      NA,
   198002,   1,      198001,   1,
   198002,   2,      NA,      NA,
   198002,   3,      NA,      NA,
   198003,   1,      198002,   1,
   198003,   2,      NA,      NA,
   198003,   3,      198002,   3)

where

  • YYYYMM is the date of an interview.

  • DATEPR is the date of the respondent's previous interview, if there was one.

  • ID is unique only within interview waves. This means, for example, that the respondent for whom ID==2 and YYYMM==198001 need not be the respondent for whom ID==2 and YYYMM==198002.

  • IDPREV is the ID of the respondent's previous interview, if there was one.

There are nine rows in the tibble above. But one respondent was interviewed three times, and another was interviewed twice. I want only the last interview for each respondent, so I want a tibble with only six rows. This code does the job:

for (i in 1:nrow(tmp)) {
  if (!is.na(tmp$DATEPR[i])) {
    ind <- which(tmp$YYYYMM == tmp$DATEPR[i] & tmp$ID == tmp$IDPREV[i])
    tmp <- tmp[-ind, ]
  }
}

But it seems a little hard to parse. Is there a clearer way to achieve the same end with the tidyverse functions? I have in mind a two–step function: first, get the indices of all rows to remove; second, remove the rows. But I haven't been able to implement this solution with map or with the dplyr functions.

1

1 Answers

2
votes

If all previously interviewed IDs are listed on the 3rd and 4th columns, you can do a left anti_join of the data frame with itself using dplyr::anti_join, here you match DATEPR and IDPREV with the YYYYMM and ID respectively, only rows from temp whose YYYYMM and ID that don't have a match will be left:

anti_join(tmp, tmp, by = c("YYYYMM" = "DATEPR", "ID" = "IDPREV")) %>% 
    arrange(YYYYMM, ID)

# A tibble: 6 x 4
#  YYYYMM    ID DATEPR IDPREV
#   <dbl> <dbl>  <dbl>  <dbl>
#1 198001     2     NA     NA
#2 198001     3     NA     NA
#3 198002     2     NA     NA
#4 198003     1 198002      1
#5 198003     2     NA     NA
#6 198003     3 198002      3

After running your code:

for (i in 1:nrow(tmp)) {
    if (!is.na(tmp$DATEPR[i])) {
        ind <- which(tmp$YYYYMM == tmp$DATEPR[i] & tmp$ID == tmp$IDPREV[i])
        tmp <- tmp[-ind, ]
    }
}
tmp %>% arrange(YYYYMM, ID)
# A tibble: 6 x 4
#  YYYYMM    ID DATEPR IDPREV
#   <dbl> <dbl>  <dbl>  <dbl>
#1 198001     2     NA     NA
#2 198001     3     NA     NA
#3 198002     2     NA     NA
#4 198003     1 198002      1
#5 198003     2     NA     NA
#6 198003     3 198002      3