Let's start with what rbinom(1, 1, 0.2)
does: it returns 1
instance of 1
independent Bernoulli (that is, 0-1) random variables added together that have a probability of 0.2
of being equal to 1. So, that line will only give outputs 0 (which it will do 80% of the time) or 1 (which it will do the other 20% of the time). As you noted, this isn't what you want.
The issue here is the selection of a random variable. A binomial variable is great for something like, "I roll ten dice. How many land on 6?" because it has the following essential components:
- outcomes dichotomized into success / failure
- a fixed number (ten) of trials
- a consistent probability of success (1/6)
- independent trials (dice do not affect each other)
The situation you're describing doesn't have those features. So, what to do?
Option 1: Go with your instinct for a while()
loop. I'll preface this by saying that while()
loops are discouraged in R for various reasons (chiefly inefficiency). But, since you already understand the concept, let's run with it.
one_patient <- function(){
status <- 1 # 1 = healthy, 0 = ill
years <- (-1) # count how many years completed while healthy
while(status == 1){
years <- years + 1 # this line will run at least one time
status <- rbinom(1, 1, 0.2) # your rbinom(1, 1, 0.2) line makes an appearance!
}
return(years)
}
Now, executing one_patient()
will result in the number of the years the patient successfully transitioned from well to well. This will be at least 0, since years
starts at -1 and is incremented at least one time. It could be very high, if the patient is lucky, though it most likely won't. You can experiment with this by changing the 0.2 parameter to something more optimistic like 0.99 to simulate long life spans.
Option 2: Rethink the random variable. I mentioned above that the variable wasn't binomial; in fact, it's geometric. A situation like, "I roll a die until it lands on 6. How many rolls did it take?" is geometric because it has the following essential components:
- outcomes dichotomized into success / failure
- a consistent probability of success
- repeated trials that terminate when the first success is reached
- independent trials
Much like how binomial variables have useful functions in R such as rbinom(), pbinom(), qbinom(), dbinom()
, there is a corresponding collection for geometric variables: rgeom(), pgeom(), qgeom(), dgeom()
.
To use rgeom()
, we need to be careful about one detail: here, a "success" is characterized as the patient becoming ill, because that's when the experiment ends. (Above, by encoding the patient being well as 1, we're implicitly using the reverse perspective.) This means that the "success" probability is 0.8. rgeom(1, 0.8)
will return the number of draws strictly before the first success, which is equivalent to the number of years the patient went from well to well, as above. Note that the 1 parameter refers to the number of times we want to run this experiment and not something else. Hence:
rgeom(1, 0.8)
will accomplish the same task as the one_patient()
function we defined above. (That is, the distribution of outputs for each will be the same.)
For multiple patients, you can either wrap the one_patient()
function inside replicate()
, or you can just directly adjust the first parameter of rgeom(1, 0.8)
. The second option is much faster, though both are fast if just simulating 1000 patients.
Addendum
Proof that both have the same effect:
sims1 <- replicate(10000, one_patient())
hist(sims1, breaks = seq(-0.5, max(sims1) + 0.5, by = 1))
sims2 <- rgeom(10000, 0.8)
hist(sims2, breaks = seq(-0.5, max(sims2) + 0.5, by = 1))
Proof that rgeom()
is faster:
library(microbenchmark)
microbenchmark(
replicate(10000, one_patient()),
rgeom(10000, 0.8)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# replicate(10000, one_patient()) 35.4520 38.77585 44.135562 43.82195 46.05920 73.5090 100
# rgeom(10000, 0.8) 1.1978 1.22540 1.273766 1.23640 1.27485 1.9734 100