7
votes

I have a trouble with dplyr sample_n function. Im trying to randomly extract subsets from data.frame but its failed. Because sample_n only extracting random rows.

Here ara some examples that showing how to extract random rows from each subset.

sample-rows-of-subgroups-from-dataframe-with-dplyr

selecting-n-random-rows-across-all-levels-of-a-factor-within-a-dataframe

This is not what I want. I want to extract groups randomly from a data frame not the random rows from each subset.

For example,

 xx <- rep(rep(seq(0,800,200),each=10),times=2)
       yy<-c(replicate(2,sort(10^runif(10,-1,0),decreasing=TRUE)),replicate(2,sort(10^runif(10,-1,0),decreasing=TRUE)), replicate(2,sort(10^runif(10,-2,0),decreasing=TRUE)),replicate(2,sort(10^runif(10,-3,0),decreasing=TRUE)), replicate(2,sort(10^runif(10,-4,0), decreasing=TRUE)))    
       V <- rep(seq(100,2500,length.out=10),times=2)
       No <- rep(1:10,each=10)
       df <- data.frame(V,xx,yy,No)
library(dplyr)
    random <-  df %>%
            group_by(No)%>%
        sample_n(5,replace=T)  ## This part is the problem.

For instance how to randomly extract 3 subsets with all their rows kept?

                V  xx           yy No
    1    100.0000   0 0.9877468589  1
    2    366.6667   0 0.6658268649  1
    3    633.3333   0 0.4408336374  1
    4    900.0000   0 0.4136939054  1
    5   1166.6667   0 0.4104986026  1
    6   1433.3333   0 0.3899468530  1
    7   1700.0000   0 0.3042157845  1
    8   1966.6667   0 0.1585948347  1
    9   2233.3333   0 0.1307305044  1
    10  2500.0000   0 0.1079459480  1
    11   100.0000 200 0.7437972385  2
    12   366.6667 200 0.7130753133  2
    13   633.3333 200 0.6000577122  2
    14   900.0000 200 0.5038569759  2
    15  1166.6667 200 0.3740146819  2
    16  1433.3333 200 0.3605675251  2
    17  1700.0000 200 0.1821736571  2
    18  1966.6667 200 0.1542015388  2
    19  2233.3333 200 0.1453810015  2
    20  2500.0000 200 0.1142553452  2
    21   100.0000 400 0.9712414163  3
    22   366.6667 400 0.5420861908  3
    23   633.3333 400 0.4622129942  3
    24   900.0000 400 0.3634606046  3
    25  1166.6667 400 0.3541710297  3
    26  1433.3333 400 0.3451167353  3
    27  1700.0000 400 0.2413016960  3
    28  1966.6667 400 0.2356020402  3
    29  2233.3333 400 0.2054358298  3
    30  2500.0000 400 0.1132074106  3
    31   100.0000 600 0.9220690387  4
    32   366.6667 600 0.8772938566  4
    33   633.3333 600 0.7560569362  4
    34   900.0000 600 0.5395093190  4
    35  1166.6667 600 0.3696490756  4
    36  1433.3333 600 0.1585255169  4
    37  1700.0000 600 0.1425756544  4
    38  1966.6667 600 0.1135199782  4
    39  2233.3333 600 0.1061660399  4
    40  2500.0000 600 0.1052644706  4
    41   100.0000 800 0.6175240054  5
    42   366.6667 800 0.5527556076  5
    43   633.3333 800 0.4339775258  5
    44   900.0000 800 0.2462104866  5
    45  1166.6667 800 0.1955550477  5
    46  1433.3333 800 0.1701907232  5
    47  1700.0000 800 0.0824833313  5
    48  1966.6667 800 0.0483463760  5
    49  2233.3333 800 0.0246629341  5
    50  2500.0000 800 0.0186177562  5
    51   100.0000   0 0.8977179587  6
    52   366.6667   0 0.8087930175  6
    53   633.3333   0 0.5547978713  6
    54   900.0000   0 0.4395436341  6
    55  1166.6667   0 0.2972449261  6
    56  1433.3333   0 0.0925262903  6
    57  1700.0000   0 0.0665688788  6
    58  1966.6667   0 0.0309263319  6
    59  2233.3333   0 0.0238500731  6
    60  2500.0000   0 0.0213679919  6
    61   100.0000 200 0.7777420232  7
    62   366.6667 200 0.2299083233  7
    63   633.3333 200 0.0611370244  7
    64   900.0000 200 0.0228982941  7
    65  1166.6667 200 0.0150085546  7
    66  1433.3333 200 0.0076922035  7
    67  1700.0000 200 0.0066120335  7
    68  1966.6667 200 0.0062052827  7
    69  2233.3333 200 0.0037895910  7
    70  2500.0000 200 0.0011051211  7
    71   100.0000 400 0.3829786486  8
    72   366.6667 400 0.1901274442  8
    73   633.3333 400 0.1775864007  8
    74   900.0000 400 0.0567928196  8
    75  1166.6667 400 0.0414294193  8
    76  1433.3333 400 0.0127875497  8
    77  1700.0000 400 0.0105576089  8
    78  1966.6667 400 0.0051503839  8
    79  2233.3333 400 0.0035216836  8
    80  2500.0000 400 0.0012326419  8
    81   100.0000 600 0.0370072219  9
    82   366.6667 600 0.0297765049  9
    83   633.3333 600 0.0219866835  9
    84   900.0000 600 0.0140510807  9
    85  1166.6667 600 0.0021593963  9
    86  1433.3333 600 0.0018936887  9
    87  1700.0000 600 0.0017860546  9
    88  1966.6667 600 0.0001551491  9
    89  2233.3333 600 0.0001345905  9
    90  2500.0000 600 0.0001048041  9
    91   100.0000 800 0.7343220323 10
    92   366.6667 800 0.1653557177 10
    93   633.3333 800 0.1006331452 10
    94   900.0000 800 0.0083407709 10
    95  1166.6667 800 0.0043037301 10
    96  1433.3333 800 0.0032461136 10
    97  1700.0000 800 0.0015843809 10
    98  1966.6667 800 0.0004819055 10
    99  2233.3333 800 0.0002991639 10
    100 2500.0000 800 0.0001447263 10
1
Does it have to use dplyr? you could easily do that with base R e.g. df[df$No %in% sample(unique(df$No),5),]JeremyS
@JeremyS I prefer to dplyr because there are some process after this subset. in addition, your code doesnt change anything when I run it::Alexander
@JeremyS's solution is very elegant and provides what you want. You could do something like df[df$No %in% sample(unique(df$No),5),] %>% group_by(No) %>% do_stuff.Heroka

1 Answers

11
votes

Maybe this is what you are after:

# sample from distinct values of No
my_groups <- 
  df %>% 
  select(No) %>% 
  distinct %>% 
  sample_n(5)

# merge the two datasets
my_df <-
  left_join(my_groups, df)