0
votes

I have the following problem I'd like to solve in R and apply to a larger workflow. I need to estimate the standard deviation from a gamma distribution where the mean and 95% confidence intervals are known.

state = c("group1", "group2", "group3")
mean = c(0.970, 0.694, 0.988)
lowers = c(0.527, 0.381, 0.536)
uppers = c(1.87, 1.37, 1.90)

df = data.frame(state=state, mean=mean, lower=lower, upper=upper)

Using excel and the "solver" tool I can adjust the standard deviation to minimize the sum of squared differences between the target 2.5 (lowers) and 97.5 (uppers) percentiles of the distribution with the actuals. Challenge is this needs to be scaled up to a rather large set of data and operationalized in my R dataframe workflow. Any ideas how to solve this?

1
Do you know the "truth" of these three samples? - r2evans
I'm curious: if you have large amounts of data, how is it easier to come up with the 95% confidence interval than it is the standard deviation? Variance can be calculated additively, so distributed computation on this is relatively straight-forward. Quantiles are more difficult to do perfectly at scale, since they generally require sorted data. - r2evans
I have a large amount of data I need to apply this problem to. I do not have access to the underlying data that went into calculating the mean and 95% CIs. - bsleeter
Okay, that makes sense. What have you been able to accomplish so far, even if you think it is inefficient? My guess is that it might use optim or optimize. - r2evans
Not much in R. I have an Excel workbook that does this - one by one - which is manageable but certainly not ideal. Really hoping to get a solution in code that can be a part of our regular workflow and reproducible. I started looking at optim but got lost pretty quick. - bsleeter

1 Answers

0
votes

I think this problem is ultimately an optimization problem, dealing one row of data at a time. Since you want to scale it, though, here's an approximation for finding the distribution core parameters.

This process is not an optimization: it expands a defined range of possible k (shape) parameters and finds the shape/scale combination (given your mean) that most closely resembles your upper and lower quantiles. You control the granularity of k, which is as good as you're going to get to having a tolerance (which would be appropriate for optimizations).

As such, this process will be imperfect. I hope it gets you a fast-enough process for good-enough accuracy.

I'm going to first demonstrate a process that operates one row at a time, as guesser1. I'll then expand it to do the same operation on an arbitrary number of mean, lower, and upper.

Data with Known Answers

But first, I want to generate my own samples so that we have known "truth" for this guesser.

library(dplyr)
set.seed(42)
n <- 4
randks <- tibble(
  k = sample(1:10, size = n, replace = TRUE),
  scale = sample(seq(0.5, 2.5, by = 0.5), size = n, replace = TRUE)
) %>%
  mutate(
    samp = map2(k, scale, ~ rgamma(1000, shape = .x, scale = .y)),
    theor_mu = k*scale,
    mu = map_dbl(samp, ~ mean(.x)),
    lwr = map_dbl(samp, ~ quantile(.x, 0.025)),
    upr = map_dbl(samp, ~ quantile(.x, 0.975))
  ) %>%
  select(-samp)
randks
# # A tibble: 4 x 6
#       k scale theor_mu    mu   lwr   upr
#   <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
# 1    10   2       20   19.9   9.47 33.7 
# 2    10   1.5     15   15.1   7.36 25.0 
# 3     3   2        6    5.85  1.08 14.5 
# 4     9   0.5      4.5  4.51  1.99  7.72

Guesser1

Single row at a time:

guesser1 <- function(mu, lwr, upr, k.max = 10, k.by = 0.01) {
  stopifnot(length(mu) == 1, length(lwr) == 1, length(upr) == 1)
  ks <- seq(0, k.max, by = k.by)[-1]
  L <- sapply(ks, function(k) qgamma(0.025, shape = k, scale = mu / k))
  U <- sapply(ks, function(k) qgamma(0.975, shape = k, scale = mu / k))
  dists <- sqrt((L-lwr)^2 + (U-upr)^2)
  ind <- which.min(dists)
  data.frame(
    k     = ks[ind],
    scale = mu / ks[ind],
    dist  = min(dists),
    lwr   = L[ind],
    upr   = U[ind]
  )
}

In action:

out1 <- do.call(rbind, Map(guesser1, randks$mu, randks$lwr, randks$upr))
cbind(subset(randks, select = -theor_mu), out1)
#    k scale    mu  lwr   upr     k scale  dist  lwr   upr
# 1 10   2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10   1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3  3   2.0  5.85 1.08 14.50  2.76 2.120 0.020 1.10 14.50
# 4  9   0.5  4.51 1.99  7.72  9.55 0.472 0.142 2.12  7.79
###  \____ randks __________/  \____ guessed ____________/

There are certainly some differences, underscoring my original assertion that this is imperfect.

Guessers

All rows at once. This is a little more work in the function, since it deals with matrices instead of just vectors. Not a problem, I just wanted to prove it one-at-a-time before going for the gusto.

guessers <- function(mu, lwr, upr, k.max = 10, k.by = 0.01, include.size = FALSE) {
  stopifnot(length(mu) == length(lwr), length(mu) == length(upr))
  # count <- length(mu)
  ks <- seq(0, k.max, by = k.by)[-1]
  # 'ks'      dims: [mu]
  L <- outer(mu, ks, function(m, k) qgamma(0.025, shape = k, scale = m / k))
  U <- outer(mu, ks, function(m, k) qgamma(0.975, shape = k, scale = m / k))
  # 'L' & 'U' dims: [mu, ks]
  dists <- sqrt((L - lwr)^2 + (U - upr)^2)
  inds <- apply(dists, 1, which.min)
  mindists <- apply(dists, 1, min)
  i <- seq_along(mu)
  out <- data.frame(
    k = ks[inds],
    scale = mu / ks[inds],
    dist = mindists,
    lwr = L[cbind(i, inds)],
    upr = U[cbind(i, inds)]
  )
  size <- if (include.size) {
    message("guessers memory: ",
            object.size(list(ks, L, U, dists, inds, mindists, i, out)))
  }
  out
}

In action:

outs <- guessers(randks$mu, randks$lwr, randks$upr, include.size = TRUE)
# guessers memory: 106400
cbind(subset(randks, select = -theor_mu), outs)
#    k scale    mu  lwr   upr     k scale  dist  lwr   upr
# 1 10   2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10   1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3  3   2.0  5.85 1.08 14.50  2.76 2.120 0.020 1.10 14.50
# 4  9   0.5  4.51 1.99  7.72  9.55 0.472 0.142 2.12  7.79
###  \____ randks __________/  \____ guessed (same) _____/

(I included a memory message in there just to track how much this can scale. It's not bad now, and that argument should definitely not be used in production. FYI.)

Comparison

Using microbenchmark, we repeat each operation a few times and compare their run times.

microbenchmark::microbenchmark(
  g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
  gs = guessers(randks$mu, randks$lwr, randks$upr)
)
# Unit: milliseconds
#  expr  min   lq mean median   uq   max neval
#    g1 27.3 28.8 33.9   29.7 33.0 131.1   100
#    gs 13.3 13.6 14.4   13.8 14.6  20.3   100

Not too surprisingly, the all-at-once guessers is a bit faster. When will this not be the case? When the number of rows gets so big that memory consumption is a problem. I don't know what that is.

Let's try the same thing, but increasing randks from 4 rows to 1000 and repeating the benchmark.

n <- 1000
# randks <- ...
nrow(randks)
# [1] 1000
microbenchmark::microbenchmark(
  g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
  gs = guessers(randks$mu, randks$lwr, randks$upr),
  times = 10
)
# Unit: seconds
#  expr  min   lq mean median   uq   max neval
#    g1 8.50 8.99 9.59   9.31 9.64 11.72    10
#    gs 3.35 3.44 3.61   3.63 3.77  3.93    10

So it's definitely faster. The median run-time for 1000 estimations is 3.63 seconds, so it appears to finish about 300/sec.

guessers memory: 24066176

(24 MiB) Actually, that doesn't seem bad at all. Decrease k.by to increase your accuracy, or increase k.by to speed this up.