Conditional selection of repeated measures from data frame

Question

I have data with repeat measurements on each subject (id) at a variable number of timepoints. I would like to retain two row for each subject, timepoint == 0 and the timepoint closest to 4. In the case rows with two candidate timepoints equally distant from 4, e.g. (3, 5), I want to chose the lowest (3).

As shown in the 'choice' column of the image below, rows with "x" would not be retained.

 dat <- structure(list(id = c(172507L, 172507L, 172507L, 172525L, 172525L, 
172525L, 172526L, 172526L, 172526L, 172527L, 172527L, 172527L, 
172527L, 172527L), timepoint = c(0L, 2L, 6L, 0L, 4L, 5L, 0L, 
5L, 2L, 2L, 3L, 5L, 6L, 0L)), class = "data.frame", row.names = c(NA, 
-14L))

What's the logic for retaining timepoint = 3 for the single instance of id = 172528, but discarding timepoint = 5 for the single instance of 172529, or timepoint = 6 for the single instance of 172530 ? — neilfws
I'm also confused. Why discard the single measurements for id = 172529 and id = 172530? — Maurits Evers
I believe the data now matches the first two columns of the image, as intended. — user25494

Ronak Shah Ronak Shah · Accepted Answer · 2019-05-02T23:37:56

We could arrange by id and timepoint and for every group select the first occurrence when timepoint == 0 and minimum absolute value between 4 - timepoint. Since we have arranged it by timepoint which.min will select first timepoint with lower value (in case of tie).

library(dplyr)
dat %>%
  arrange(id, timepoint) %>%
  group_by(id) %>%
  slice(c(which.max(timepoint == 0), which.min(abs(4- timepoint))))

#     id timepoint
#   <int>     <int>
#1 172507         0
#2 172507         2
#3 172525         0
#4 172525         4
#5 172526         0
#6 172526         5
#7 172527         0
#8 172527         3

Conditional selection of repeated measures from data frame

4 Answers