I think this problem is ultimately an optimization problem, dealing one row of data at a time. Since you want to scale it, though, here's an approximation for finding the distribution core parameters.
This process is not an optimization: it expands a defined range of possible k
(shape) parameters and finds the shape/scale combination (given your mean) that most closely resembles your upper and lower quantiles. You control the granularity of k
, which is as good as you're going to get to having a tolerance (which would be appropriate for optimizations).
As such, this process will be imperfect. I hope it gets you a fast-enough process for good-enough accuracy.
I'm going to first demonstrate a process that operates one row at a time, as guesser1
. I'll then expand it to do the same operation on an arbitrary number of mean
, lower
, and upper
.
Data with Known Answers
But first, I want to generate my own samples so that we have known "truth" for this guesser.
library(dplyr)
set.seed(42)
n <- 4
randks <- tibble(
k = sample(1:10, size = n, replace = TRUE),
scale = sample(seq(0.5, 2.5, by = 0.5), size = n, replace = TRUE)
) %>%
mutate(
samp = map2(k, scale, ~ rgamma(1000, shape = .x, scale = .y)),
theor_mu = k*scale,
mu = map_dbl(samp, ~ mean(.x)),
lwr = map_dbl(samp, ~ quantile(.x, 0.025)),
upr = map_dbl(samp, ~ quantile(.x, 0.975))
) %>%
select(-samp)
randks
# # A tibble: 4 x 6
# k scale theor_mu mu lwr upr
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10 2 20 19.9 9.47 33.7
# 2 10 1.5 15 15.1 7.36 25.0
# 3 3 2 6 5.85 1.08 14.5
# 4 9 0.5 4.5 4.51 1.99 7.72
Guesser1
Single row at a time:
guesser1 <- function(mu, lwr, upr, k.max = 10, k.by = 0.01) {
stopifnot(length(mu) == 1, length(lwr) == 1, length(upr) == 1)
ks <- seq(0, k.max, by = k.by)[-1]
L <- sapply(ks, function(k) qgamma(0.025, shape = k, scale = mu / k))
U <- sapply(ks, function(k) qgamma(0.975, shape = k, scale = mu / k))
dists <- sqrt((L-lwr)^2 + (U-upr)^2)
ind <- which.min(dists)
data.frame(
k = ks[ind],
scale = mu / ks[ind],
dist = min(dists),
lwr = L[ind],
upr = U[ind]
)
}
In action:
out1 <- do.call(rbind, Map(guesser1, randks$mu, randks$lwr, randks$upr))
cbind(subset(randks, select = -theor_mu), out1)
# k scale mu lwr upr k scale dist lwr upr
# 1 10 2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10 1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3 3 2.0 5.85 1.08 14.50 2.76 2.120 0.020 1.10 14.50
# 4 9 0.5 4.51 1.99 7.72 9.55 0.472 0.142 2.12 7.79
### \____ randks __________/ \____ guessed ____________/
There are certainly some differences, underscoring my original assertion that this is imperfect.
Guessers
All rows at once. This is a little more work in the function, since it deals with matrices instead of just vectors. Not a problem, I just wanted to prove it one-at-a-time before going for the gusto.
guessers <- function(mu, lwr, upr, k.max = 10, k.by = 0.01, include.size = FALSE) {
stopifnot(length(mu) == length(lwr), length(mu) == length(upr))
# count <- length(mu)
ks <- seq(0, k.max, by = k.by)[-1]
# 'ks' dims: [mu]
L <- outer(mu, ks, function(m, k) qgamma(0.025, shape = k, scale = m / k))
U <- outer(mu, ks, function(m, k) qgamma(0.975, shape = k, scale = m / k))
# 'L' & 'U' dims: [mu, ks]
dists <- sqrt((L - lwr)^2 + (U - upr)^2)
inds <- apply(dists, 1, which.min)
mindists <- apply(dists, 1, min)
i <- seq_along(mu)
out <- data.frame(
k = ks[inds],
scale = mu / ks[inds],
dist = mindists,
lwr = L[cbind(i, inds)],
upr = U[cbind(i, inds)]
)
size <- if (include.size) {
message("guessers memory: ",
object.size(list(ks, L, U, dists, inds, mindists, i, out)))
}
out
}
In action:
outs <- guessers(randks$mu, randks$lwr, randks$upr, include.size = TRUE)
# guessers memory: 106400
cbind(subset(randks, select = -theor_mu), outs)
# k scale mu lwr upr k scale dist lwr upr
# 1 10 2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10 1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3 3 2.0 5.85 1.08 14.50 2.76 2.120 0.020 1.10 14.50
# 4 9 0.5 4.51 1.99 7.72 9.55 0.472 0.142 2.12 7.79
### \____ randks __________/ \____ guessed (same) _____/
(I included a memory message in there just to track how much this can scale. It's not bad now, and that argument should definitely not be used in production. FYI.)
Comparison
Using microbenchmark
, we repeat each operation a few times and compare their run times.
microbenchmark::microbenchmark(
g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
gs = guessers(randks$mu, randks$lwr, randks$upr)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# g1 27.3 28.8 33.9 29.7 33.0 131.1 100
# gs 13.3 13.6 14.4 13.8 14.6 20.3 100
Not too surprisingly, the all-at-once guessers
is a bit faster. When will this not be the case? When the number of rows gets so big that memory consumption is a problem. I don't know what that is.
Let's try the same thing, but increasing randks
from 4 rows to 1000 and repeating the benchmark.
n <- 1000
# randks <- ...
nrow(randks)
# [1] 1000
microbenchmark::microbenchmark(
g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
gs = guessers(randks$mu, randks$lwr, randks$upr),
times = 10
)
# Unit: seconds
# expr min lq mean median uq max neval
# g1 8.50 8.99 9.59 9.31 9.64 11.72 10
# gs 3.35 3.44 3.61 3.63 3.77 3.93 10
So it's definitely faster. The median run-time for 1000 estimations is 3.63 seconds, so it appears to finish about 300/sec.
guessers memory: 24066176
(24 MiB) Actually, that doesn't seem bad at all. Decrease k.by
to increase your accuracy, or increase k.by
to speed this up.
optim
oroptimize
. - r2evans