Programming probability of patients being well in R

Question

I'm going to preface this with the fact that I am a complete R novice.
I have the following problem:

Consider a simple model that progresses year-by-year. In year i, let W_i = patient is well, I_i = patient is ill, and D_i = patient is dead. Transitions can be modeled as a set of conditional probabilities.

Let L = number of years that the patient is well.

I have come up with the probability mass function of L to be P(L)=(1-p)(p)^{L-1}.
The given information is that a patient is well in year 1 and given their age and risk factors, P(W_{i+1}|W_{i})=0.2 for all i

The problem is to write a function in R that simulates the trajectory of a single patient and returns the number of years the patient is well.

I thought that this could be programmed in R as a binomial distribution using the rbinom function. For a single patient,

rbinom(1, 1, 0.2)

but I don't think that this would return the number of years that the patient is well. I'm thinking that the rbinom function should be the start, and that it would need to be paired with a way to count the number of years that a patient is well, but I don't know how to do that.

The next problem is to use R to simulate 1000 patient trajectories and find the sample mean of years of wellness. I'm assuming that this would be an extension of the previous part, just replacing the 1 patient with 1000. However I can't quite figure out where to replace the 1 with 1000: n or size

rbinom(n, size, prob)

This is assuming that using rbinom is the correct thing to do in the first place... If I were to do this in another programming language (say Python) I would use a while loop conditional on patient_status=W and starting with L=0 iterate through the loop and add 1 each successful iteration. I'm not sure if R works in the same way.

Aaron Montgomery Aaron Montgomery · Accepted Answer · 2020-02-18T16:08:17

Let's start with what rbinom(1, 1, 0.2) does: it returns 1 instance of 1 independent Bernoulli (that is, 0-1) random variables added together that have a probability of 0.2 of being equal to 1. So, that line will only give outputs 0 (which it will do 80% of the time) or 1 (which it will do the other 20% of the time). As you noted, this isn't what you want.

The issue here is the selection of a random variable. A binomial variable is great for something like, "I roll ten dice. How many land on 6?" because it has the following essential components:

outcomes dichotomized into success / failure
a fixed number (ten) of trials
a consistent probability of success (1/6)
independent trials (dice do not affect each other)

The situation you're describing doesn't have those features. So, what to do?

Option 1: Go with your instinct for a while() loop. I'll preface this by saying that while() loops are discouraged in R for various reasons (chiefly inefficiency). But, since you already understand the concept, let's run with it.

one_patient <- function(){
  status <- 1                       # 1 = healthy, 0 = ill
  years <- (-1)                     # count how many years completed while healthy
  while(status == 1){
    years <- years + 1              # this line will run at least one time
    status <- rbinom(1, 1, 0.2)     # your rbinom(1, 1, 0.2) line makes an appearance!
  }
  return(years)
}

Now, executing one_patient() will result in the number of the years the patient successfully transitioned from well to well. This will be at least 0, since years starts at -1 and is incremented at least one time. It could be very high, if the patient is lucky, though it most likely won't. You can experiment with this by changing the 0.2 parameter to something more optimistic like 0.99 to simulate long life spans.

Option 2: Rethink the random variable. I mentioned above that the variable wasn't binomial; in fact, it's geometric. A situation like, "I roll a die until it lands on 6. How many rolls did it take?" is geometric because it has the following essential components:

outcomes dichotomized into success / failure
a consistent probability of success
repeated trials that terminate when the first success is reached
independent trials

Much like how binomial variables have useful functions in R such as rbinom(), pbinom(), qbinom(), dbinom(), there is a corresponding collection for geometric variables: rgeom(), pgeom(), qgeom(), dgeom().

To use rgeom(), we need to be careful about one detail: here, a "success" is characterized as the patient becoming ill, because that's when the experiment ends. (Above, by encoding the patient being well as 1, we're implicitly using the reverse perspective.) This means that the "success" probability is 0.8. rgeom(1, 0.8) will return the number of draws strictly before the first success, which is equivalent to the number of years the patient went from well to well, as above. Note that the 1 parameter refers to the number of times we want to run this experiment and not something else. Hence:

rgeom(1, 0.8)

will accomplish the same task as the one_patient() function we defined above. (That is, the distribution of outputs for each will be the same.)

For multiple patients, you can either wrap the one_patient() function inside replicate(), or you can just directly adjust the first parameter of rgeom(1, 0.8). The second option is much faster, though both are fast if just simulating 1000 patients.

Addendum

Proof that both have the same effect:

sims1 <- replicate(10000, one_patient())
hist(sims1, breaks = seq(-0.5, max(sims1) + 0.5, by = 1))

sims2 <- rgeom(10000, 0.8)
hist(sims2, breaks = seq(-0.5, max(sims2) + 0.5, by = 1))

Proof that rgeom() is faster:

library(microbenchmark)
microbenchmark(
  replicate(10000, one_patient()),
  rgeom(10000, 0.8)
)
#Unit: milliseconds
#                            expr     min       lq      mean   median       uq     max neval
# replicate(10000, one_patient()) 35.4520 38.77585 44.135562 43.82195 46.05920 73.5090   100
#               rgeom(10000, 0.8)  1.1978  1.22540  1.273766  1.23640  1.27485  1.9734   100

Programming probability of patients being well in R

1 Answers

Addendum