2
votes

I have data with repeat measurements on each subject (id) at a variable number of timepoints. I would like to retain two row for each subject, timepoint == 0 and the timepoint closest to 4. In the case rows with two candidate timepoints equally distant from 4, e.g. (3, 5), I want to chose the lowest (3).

As shown in the 'choice' column of the image below, rows with "x" would not be retained.

 dat <- structure(list(id = c(172507L, 172507L, 172507L, 172525L, 172525L, 
172525L, 172526L, 172526L, 172526L, 172527L, 172527L, 172527L, 
172527L, 172527L), timepoint = c(0L, 2L, 6L, 0L, 4L, 5L, 0L, 
5L, 2L, 2L, 3L, 5L, 6L, 0L)), class = "data.frame", row.names = c(NA, 
-14L))

enter image description here

4
What's the logic for retaining timepoint = 3 for the single instance of id = 172528, but discarding timepoint = 5 for the single instance of 172529, or timepoint = 6 for the single instance of 172530 ?neilfws
I'm also confused. Why discard the single measurements for id = 172529 and id = 172530?Maurits Evers
I believe the data now matches the first two columns of the image, as intended.user25494

4 Answers

6
votes

We could arrange by id and timepoint and for every group select the first occurrence when timepoint == 0 and minimum absolute value between 4 - timepoint. Since we have arranged it by timepoint which.min will select first timepoint with lower value (in case of tie).

library(dplyr)
dat %>%
  arrange(id, timepoint) %>%
  group_by(id) %>%
  slice(c(which.max(timepoint == 0), which.min(abs(4- timepoint))))

#     id timepoint
#   <int>     <int>
#1 172507         0
#2 172507         2
#3 172525         0
#4 172525         4
#5 172526         0
#6 172526         5
#7 172527         0
#8 172527         3
2
votes

Can you do something like this. Arranging by the distance and then the timepoint will put the smallest closest value first. Then you can use the first() function to grab the first value or filter for when the timepoint is zero.

library(tidyverse)

dat %>% 
  mutate(dist = abs(4-timepoint)) %>% 
  arrange(id, dist, timepoint) %>% 
  group_by(id) %>% 
  filter(timepoint %in% c(0, first(timepoint))) %>% 
  ungroup() %>% 
  arrange(id, timepoint)

2
votes

Here's the data.table solution. It relies on the assumption that each ID will have a timepoint of 0. Otherwise, you should use which.max(timepoint == 0). Credit to Ronak Shah for the which.min approach.

Edit: Changed to match(TRUE, timepoint == 0) and fixed an issue in base R approach.

library(data.table)

dt <- as.data.table(dat)

dt[order(timepoint),
 .SD[c(match(TRUE, timepoint == 0), which.min(abs(4- timepoint)))],
 by = id]

For kicks, here's base R:

do.call(rbind, by(dat[order(dat$timepoint), ], dat[order(dat$timepoint), ], function(x) x[c(match(TRUE, x$timepoint == 0), which.min(abs(4-x$timepoint))),]) )
0
votes

Something like this should work:

zeros <- 
  dat %>% 
  filter(timepoint == 0) %>% 
  transmute(id, timepoint)

nonzeros <- 
  dat %>% 
  filter(timepoint != 0) %>% 
  mutate(diff = abs(timepoint - 4)) %>% 
  group_by(id) %>% 
  filter(diff == min(diff)) %>% 
  arrange(timepoint) %>% 
  slice(1) %>% 
  ungroup() %>% 
  transmute(id, timepoint)

df <-
  bind_rows(zeros, nonzeros) %>% 
  arrange(id, timepoint)

There is probably a way to do this in one pipe but I had an easier time visualizing what's going on this way.