1
votes

I have a dataframe in which the first column contains unique row IDs, and the second column contains values that are often not unique between rows. Below is a simplified example using iris data:

> df <- as.data.frame(iris$Sepal.Length)
> id <- rownames(df)
> df <- cbind(id, df)
> colnames(df) <- c("id", "Sepal.Length")

> nrow(df)
[1] 150

> length(unique(df$id))
[1] 150

> length(unique(df$Sepal.Length))
[1] 35

> head(df,10)
   id Sepal.Length
1   1          5.1
2   2          4.9
3   3          4.7
4   4          4.6
5   5          5.0
6   6          5.4
7   7          4.6
8   8          5.0
9   9          4.4
10 10          4.9

I would like to randomly sample from df$Sepal.Length without replacement so that the rows in the sampled data have unique row ID values.

> set.seed(22)
> df_sample <- df[sample(df$Sepal.Length, 10, replace=FALSE),]

However, replace=FALSE still gives me rows with duplicate IDs:

> duplicated(df_sample$id)
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

Is there a way to sample this data without replacement so that it returns unique rows? I am trying to specifically sample df$Sepal.Length because I would also like to supply a probability vector for this column. Thank you!

2
Apologies if this was unclear -- I am showing a simplified example, but I need to specifically sample df$Sepal.Length because I would eventually like to supply a probability vector for this column. I will update the question to state this more explicitly.le116
Maybe df[sample(length(df$Sepal.Length), 10, replace=FALSE),]. If not, maybe you aren't explaining the problem clearly.Suren
@Suren Yes this works, thank you! This is similar to Shree's answer.le116
Yes, Shree's answer is similar to mine. I wouldn't have commented if he has given it before me.Suren

2 Answers

1
votes

Here's a way -

df <- data.frame(id = 1:nrow(iris), Sepal.Length = iris$Sepal.Length)

df_sample <- df[sample(nrow(df), 10, replace = F), ]

duplicated(df_sample$id)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1
votes

You can use the sample_n() and sample_frac() functions from dplyr to do just this with a data frame. It makes sampling much easier.

sample_n(iris, 100, replace = FALSE)
sample_frac(iris, .75, replace = FALSE)