R sample probabilities: Default is equal weight; why does specifying equal weights cause different values to be returned?

Question

I have a simple question regarding the sample function in R. I'm randomly sampling from 0s and 1s and summing them together, from an input vector of length 5, which designates the number of trials to run and sets the seed to generate reproducible random numbers. Seed works as expected, but I get different matrices of random numbers depending on what I put in the prob statement. In this case I assumed prob=NULL should be the same as prob=c(0.5,0.5). Why isn't it?

vn<-c(12, 44, 9, 17, 28)

> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T)), simplify=TRUE)}))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    6    7    7    6    6    9    3    6    2     5
[2,]   22   21   20   29   22   24   24   19   25    19
[3,]    4    8    3    5    4    4    4    6    4     2
[4,]    8    4   12    9   11    7    9   10    8     8
[5,]   13    9   11   14   12   14   10   13   11    12

> do.call(cbind, lapply(c(1:10),function(X) {set.seed(X); sapply(vn, function(Y) sum(sample(x=c(0,1),size=Y,replace=T, prob=c(0.5,0.5))), simplify=TRUE)}))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    6    5    5    6    6    3    9    6   10     7
[2,]   22   23   24   15   22   20   20   25   19    25
[3,]    5    1    6    4    5    5    5    3    5     7
[4,]    9   13    5    8    6   10    8    7    9     9
[5,]   15   19   17   14   16   14   18   15   17    16

UPDATE:

I extended the samplings to 100, with an input vector

vn<-seq(0,100,5)

and compared the rowMeans of the output matrices without prob (test1) and with prob=c(0.5,0.5) against expected mean. Interestingly, test1 and test2 are off by the exact same amount by reversed signs. Why is that? Thanks!

> rowMeans(test1)-seq(0,100,5)/2
 [1]  0.00 -0.07 -0.01 -0.35 -0.07  0.19 -0.07  0.24  0.21  0.46  0.20  0.50 -0.37 -0.35  0.00  0.64 -0.59  0.63 -1.19  0.44 -0.38

> rowMeans(test2)-seq(0,100,5)/2
 [1]  0.00  0.07  0.01  0.35  0.07 -0.19  0.07 -0.24 -0.21 -0.46 -0.20 -0.50  0.37  0.35  0.00 -0.64  0.59 -0.63  1.19 -0.44  0.38

sample uses different c routines for uniform sampling and weighted sampling. Though you are using equal weights, R will call the weighted sampling anyway. — Randy Lai

Randy Lai Randy Lai · Accepted Answer · 2014-04-26T21:48:10

I updated my comment to an answer. sample uses different c routines for uniform sampling and weighted sampling. Though you are using equal weights, R will call the weighted sampling anyway. To see this, consider

> set.seed(1)
> sample.int(100)
  [1]  27  37  57  89  20  86  97  62  58   6  19  16  61  34  67  43  88  83
 [19]  32  63  75  17  51  10  21  29   1  28  81  25  87  42  70  13  55  44
 [37]  78   7  45  26  50  39  46  82  30  65   2  84  59  36  24  85  22  12
 [55]   4   5  14  23  73  79  99  47  18  95  60  77  41  53   3  69  11  71
 [73]  35  31  40  49  76   9  38  64  80  66   8  91  33  92 100  54  98  94
 [91]  52  74  68  72  93  15  56  48  90  96
> set.seed(1)
> sample.int(100, prob = rep(1/100, 100))
  [1]  28  39  60  93  21  91  96  67  63   7  22  18  71  41  79  51  74   1
 [19]  38  78  94  20  64  12  29  40   2  42  87  35  50  61  52  17  84  69
 [37]  81  10  73  44  85  65  80  54  49  82   4  46  75  68  43  90  36  23
 [55]   8  11  30  55  66  34  97  26  47  31  70  24  53  86   6  95  32  89
 [73]  27  33  56  98  88  25  77 100  37  62  19  15  76  13  59   5  14   9
 [91]  45   3  83  99  72  58  48  57  92  16

Note that the two different sampled sequences.

R sample probabilities: Default is equal weight; why does specifying equal weights cause different values to be returned?

2 Answers