Sample from groups and only maintain unique observations in the data

Question

I want to take a sample per group, allthewhile avoiding that any participant appears twice across the samples (I need this for a between-subjects ANOVA). I have a dataframe in which some participants (not all) appear twice, each time in a different group, i.e. Peter can appear in group v1=A and v2=1 but theoretically also in group v1=B and v2=3. A group is defined by the two variables v1 and v2, so according to the below code, there are 8 groups.

Now, I want to avoid the double appearance of any participant in the data by taking samples per group and randomly eliminating one observation from any participant, allthewhile maintaining similarly sized samples. I constructed the following ugly code to showcase my problem.

How do I get the last step done, so that no participant appears twice across the samples and I only have unique cases across all samples?

df1 < - data.frame(ID=c("peter","peter","chris","john","george","george","norman","josef","jan","jan","richard","richard","paul","christian","felix","felix","nick","julius","julius","moritz"),
              v1=rep(c("A","B"),10),
              v2=rep(c(1:4),5))

library(dplyr)
df2 <- df1 %>% group_by(v1,v2) %>% sample_n(2)

Answer below is nice. Another approach would be to just randomly permute the data.frame and then filter out duplicates, e.g. df1[sample(1:nrow(df1)), ] %>% filter(!duplicated(ID)) %>% group... — gfgm

markus markus · Accepted Answer · 2018-04-16T10:08:54

You could first take a sample of size 1 as per 'ID', then group_by 'v1' and 'v2' and take another sample of size 2.

library(dplyr)
set.seed(1)
df2 <- df1 %>% 
 group_by(ID) %>% 
 sample_n(1) %>% 
 group_by(v1, v2) %>% 
 sample_n(2)

df2
#   Groups:   v1, v2 [4]
#   ID      v1       v2
#   <fct>   <fct> <int>
# 1 paul    A         1
# 2 jan     A         1
# 3 norman  A         3
# 4 richard A         3
# 5 george  B         2
# 6 peter   B         2
# 7 moritz  B         4
# 8 felix   B         4

Sample from groups and only maintain unique observations in the data

1 Answers