46
votes

I have found what I would consider erratic behavior (but for which I hope there is a simple explanation) in R's use of seeds in conjunction with rbinom() when prob=0.5 is used. General idea: To me, if I set the seed, run rbinom() once (i.e. conduct a single random process), despite what value prob is set to, the random seed should change by one increment. Then, if I again set the seed to the same value, and run another random process (such as rbinom() again, but maybe with a different value of prob), the seed should again change to the same value as it did for the previous single random process.

I have found R does exactly this as long as I'm using rbinom() with any prob!=0.5. Here is an example:

Compare seed vector, .Random.seed, for two probabilities other than 0.5:

set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.4)
temp1 <- .Random.seed

set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.3)
temp2 <- .Random.seed

any(temp1!=temp2)
> [1] FALSE

Compare seed vector, .Random.seed, for prob=0.5 vs. prob!=0.5:

set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.5)
temp1 <- .Random.seed

set.seed(234908)
x <- rbinom(n=1,size=60,prob=0.3)
temp2 <- .Random.seed
any(temp1!=temp2)
> [1] TRUE

temp1==temp2
> [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> [8]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
...

I have found this for all comparisions of prob=0.5 against all other probabilities in the set {0.1, 0.2, ..., 0.9}. Similarly, if I compare any values of prob from {0.1, 0.2, ..., 0.9} other than 0.5, the .Random.seed vector is always element-by-element equal. These facts also hold true for either odd or even size within rbinom().

To make it even more strange (I apologize that this is a little convoluted - it's relevant to the way my function is written), when I use probabilities saved as elements in a vector, I have same problem if 0.5 is first element, but not second. Here is the example for this case:

First case: 0.5 is the first probability referenced in the vector

set.seed(234908)
MNAR <- c(0.5,0.3)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp1 <- .Random.seed

set.seed(234908)
MNAR <- c(0.1,0.3)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp2 <- .Random.seed

any(temp1!=temp2)
> [1] TRUE

any(temp1!=temp2)
> [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> [8]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Second case: 0.5 is the second probability referenced in the vector

set.seed(234908)
MNAR <- c(0.3,0.5)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp1 <- .Random.seed

set.seed(234908)
MNAR <- c(0.1,0.3)
x <- rbinom(n=1,size=60,prob=MNAR[1])
y <- rbinom(n=1,size=50,prob=MNAR[2])
temp2 <- .Random.seed

any(temp1!=temp2)
> [1] FALSE

Again, I find that despite the values used for prob and size, this pattern holds. Can anyone explain this mystery to me? It's causing quite a problem because results that should be the same are coming up different because the seed is for some reason used/calculated differently when prob=0.5 but in no other instance.

2
Perhaps a more concise example of this behavior would be set.seed(123);rbinom(1,60,0.5);rbinom(1,60,0.3); set.seed(123);rbinom(1,60,0.2);rbinom(1,60,0.3); set.seed(123);rbinom(1,60,0.4);rbinom(1,60,0.3) ?joran
I would go to svn.r-project.org/R/trunk/src/nmath/rbinom.c , search for unif_rand(), and follow the logic through ...Ben Bolker
An interesting result using @joran's example above is that you get back on your feet if you make prob = 0.2 or prob = 0.4 draw two numbers instead of one. It suggests that prob = 0.5 requires drawing twice as many random numbers than the other probs. That theory also checks out by replacing 60 with 120 in the OP's x <- rbinom(n=1,size=60,prob=0.3) case.flodel
HA! Look at the code Ben pointed to: there are sections for n*p >= 30 and n*p < 30. The former uses two calls to unif_rand(), the latter a single one. Now notice that your example used prob = 0.5 and size = 60, i.e. n*p == 30! Test with size = 59 and the behavior disappears!flodel
@flodel: write that up as an answer!Ben Bolker

2 Answers

40
votes

So let's turn our comments into an answer. Thanks to Ben Bolker for putting us on the right track with a link to the code: https://svn.r-project.org/R/trunk/src/nmath/rbinom.c and the suggestion to track down where unif_rand() is called.

A quick scan and it seems that the code is broken into two sections, delimited by the comments:

/*-------------------------- np = n*p >= 30 : ------------------- */

and

/*---------------------- np = n*p < 30 : ------------------------- */

Inside each of these, the number of calls to unif_rand is not the same (two versus one.)

So for a given size (n), your random seed may end up in a different state depending on the value of prob (p): whether size * prob >= 30 or not.

With that in mind, all the results you got with your examples should now make sense:

# these end up in the same state
rbinom(n=1,size=60,prob=0.4) # => np <  30
rbinom(n=1,size=60,prob=0.3) # => np <  30

# these don't
rbinom(n=1,size=60,prob=0.5) # => np >= 30
rbinom(n=1,size=60,prob=0.3) # => np <  30

# these don't
{rbinom(n=1,size=60,prob=0.5)  # np >= 30
 rbinom(n=1,size=50,prob=0.3)} # np <  30
{rbinom(n=1,size=60,prob=0.1)  # np <  30
 rbinom(n=1,size=50,prob=0.3)} # np <  30

# these do
{rbinom(n=1,size=60,prob=0.3)  # np <  30
 rbinom(n=1,size=50,prob=0.5)} # np <  30
{rbinom(n=1,size=60,prob=0.1)  # np <  30
 rbinom(n=1,size=50,prob=0.3)} # np <  30
15
votes

I'm going to take a contrarian position on this question and claim that the expectations are not appropriate and are not supported by the documentation. The documentation does not make any claim about what side effects (specifically on .Random.seed) can be expected by calling rbinom, or how those side effects may or may not be the same in various cases.

rbinom has three parameters: n, size, and prob. Your expectation is that, for a random seed set before calling rbinom, .Random.seed will be the same after calling rbinom for a given n and any values of size and prob (or maybe any finite values of size and prob). You certainly realize that it would be different for different values of n. rbinom doesn't guarantee that or imply that.

Without knowing the internals of the function, this can't be known; as the other answer showed, the algorithm is different based on the product of size and prob. And the internals may change so these specific details may change.

At least, in this case, the resulting .Random.seed will be the same after every call of rbinom which has the same n, size and prob. I can construct a pathological function for which this is not even true:

seedtweak <- function() {
  if(floor(as.POSIXlt(Sys.time())$sec * 10) %% 2) {
    runif(1)
  }
  invisible(NULL)
}

Basically, this function looks a whether the tenths of the second of the time is odd or even to decided whether or not to draw a random number. Run this function and .Random.seed may or may not change:

rs <- replicate(10, {
  set.seed(123) 
  seedtweak()
  .Random.seed
})
all(apply(rs, 1, function(x) Reduce(`==`, x)))

The best you can (should?) hope for is that a given set of code with all the inputs/parameters the same (including the seed) will always give identical results. Expecting identical results when only most (or only some) of the parameters are the same is not realistic unless all the functions called make those guarantees.