is there a faster way to search through cumulative distribution?

Question

I have a List<Double> that holds probabilities (weights) for sampling an item. For example, the List holds 5 values as follows.

0.1, 0.4, 0.2, 0.1, 0.2

Each i-th Double value is the probability of sampling the i-th item of another List<Object>.

How can I construct an algorithm to perform the sampling according to these probabilities?

I tried something like this, where I first made the list of probabilities into a cumulative form.

0.1, 0.5, 0.7, 0.8, 1.0

Then my approach is as follows. I generate a random double, and iterate over the list to find the first item that is larger than the random double, and then return its index.

Random r = new Random();
double p = r.nextDouble();
int total = list.size();
for(int i=0; i < total; i++) {
 double d = list.get(i);
 if(d > p) {
  return i;
 }
}
return total-1;

This approach is slow as I am crawling through the list sequentially. In reality, my list is of 800,000 items associated with weights (probabilities) that I need to sample from. So, needless to say, this sequential approach is slow.

I'm not sure how binary search can help. Let's say I generated p = 0.01. Then, a binary search can use recursion as follows with the list.

compare 0.01 to 0.7, repeat with L = 0.1, 0.5
compare 0.01 to 0.1, stop 
compare 0.01 to 0.5, stop

0.01 is smaller than 0.7, 0.5, and 0.1, but I obviously only want 0.1. So the stopping criteria is still not clear to me when using binary search.

If there's a library to help with this type of thing I'd also be interested.

Are you using the same weights multiple times? If so, a binary search would help, because you could transform your list of individual weights into a cumulative list. That's not going to help for generating a single value though. — Jon Skeet
@GáborBakos That won't quite work, but it's the right approach. You generate a random value, then do a binary search the cumulative list, understanding that it might not be an exact match. — David Ehrmann

David Ehrmann David Ehrmann · Accepted Answer · 2014-05-30T06:28:20

This isn't the most memory-efficient approach, but use a NavigableMap where your cumulative list's values are the keys. Then you can just use floorEntry(randon.nextDouble()). Like the binary search, it's log(n) space and n memory.

So...

NavigableMap<Double, Object> pdf = new TreeMap<>();
pdf.put(0.0, "foo");
pdf.put(0.1, "bar");
pdf.put(0.5, "baz");
pdf.put(0.7, "quz");
pdf.put(0.8, "quuz");

Random random = new Random();

pdf.floorEntry(random.nextDouble()).getValue();

is there a faster way to search through cumulative distribution?

2 Answers