1
votes

I have a data frame that looks like this. For each ID, I want to randomly assign subjects into two groups with relatively equal subjects, and I also want to add a new column that indicates which group they're in. For example, For ID 1, 101 and 103 are assigned into Group A, 102 and 104 are in Group B; for ID 2, 105 and 106 are in Group A, 107 is in Group B. And I have thousands of IDs and subjects, how can I manage to do this?

   ID subject
1  1     101
2  1     102
3  1     103
4  1     104
4  2     105
5  2     106
6  2     107
2

2 Answers

0
votes

For each ID you can sample values that you want to repeat with replace = TRUE where each value has an equal probability of occurring.

library(dplyr)
groups <- c('Group A', 'Group B')

df %>%
  group_by(ID) %>%
  mutate(group = sample(groups, n(), replace = TRUE)) -> result

Note that the above is completely random and it is possible that one ID with 4 rows have 3 rows with Group A and 1 with Group B. If you want that both the groups are always equal distributed you can use rep and sample them for randomness.

df %>%
  group_by(ID) %>%
  mutate(group = sample(rep(groups, length.out = n()))) -> result
0
votes

Using ave to apply a FUNction ID-wise, we could repeat a vector 1:2 length(ID) times and sample it; this can be done with rep_length. To avoid the vector to start always with 1 (and thereby favoring a group), we also sample the vector.

res <- transform(d, g=ave(ID, ID, FUN=function(x) 
  sample(rep_len(1:2, length(x)))))
res
#   ID subject g
# 1  1     101 2
# 2  1     102 1
# 3  1     103 2
# 4  1     104 1
# 5  2     105 1
# 6  2     106 2
# 7  2     107 1

Check on a slightly bigger data frame:

d2 <- data.frame(ID=rep(1:10, each=7), subject=1:70)
res2 <- transform(d2, g=ave(ID, ID, FUN=function(x) 
  sample(rep_len(sample(1:2), length(x)))))
with(res2, table(g, ID))
#    ID
# g   1 2 3 4 5 6 7 8 9 10
#   1 4 4 3 4 4 3 4 3 4  3
#   2 3 3 4 3 3 4 3 4 3  4

Data:

d <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), subject = 101:107), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7"))