1
votes

I have a List<Double> that holds probabilities (weights) for sampling an item. For example, the List holds 5 values as follows.

0.1, 0.4, 0.2, 0.1, 0.2

Each i-th Double value is the probability of sampling the i-th item of another List<Object>.

How can I construct an algorithm to perform the sampling according to these probabilities?

I tried something like this, where I first made the list of probabilities into a cumulative form.

0.1, 0.5, 0.7, 0.8, 1.0

Then my approach is as follows. I generate a random double, and iterate over the list to find the first item that is larger than the random double, and then return its index.

Random r = new Random();
double p = r.nextDouble();
int total = list.size();
for(int i=0; i < total; i++) {
 double d = list.get(i);
 if(d > p) {
  return i;
 }
}
return total-1;

This approach is slow as I am crawling through the list sequentially. In reality, my list is of 800,000 items associated with weights (probabilities) that I need to sample from. So, needless to say, this sequential approach is slow.

I'm not sure how binary search can help. Let's say I generated p = 0.01. Then, a binary search can use recursion as follows with the list.

compare 0.01 to 0.7, repeat with L = 0.1, 0.5
compare 0.01 to 0.1, stop 
compare 0.01 to 0.5, stop

0.01 is smaller than 0.7, 0.5, and 0.1, but I obviously only want 0.1. So the stopping criteria is still not clear to me when using binary search.

If there's a library to help with this type of thing I'd also be interested.

2
Are you using the same weights multiple times? If so, a binary search would help, because you could transform your list of individual weights into a cumulative list. That's not going to help for generating a single value though.Jon Skeet
@GáborBakos That won't quite work, but it's the right approach. You generate a random value, then do a binary search the cumulative list, understanding that it might not be an exact match.David Ehrmann

2 Answers

2
votes

This isn't the most memory-efficient approach, but use a NavigableMap where your cumulative list's values are the keys. Then you can just use floorEntry(randon.nextDouble()). Like the binary search, it's log(n) space and n memory.

So...

NavigableMap<Double, Object> pdf = new TreeMap<>();
pdf.put(0.0, "foo");
pdf.put(0.1, "bar");
pdf.put(0.5, "baz");
pdf.put(0.7, "quz");
pdf.put(0.8, "quuz");

Random random = new Random();

pdf.floorEntry(random.nextDouble()).getValue();
2
votes

Here is how you could do it using binary search, starting with the cumulative probabilities:

public static void main (String[] args) {
    double[] cdf = {0.1, 0.5, 0.7, 0.8, 1.0};
    double random = 0.75;  // generate randomly between zero and one
    int el = Arrays.binarySearch(cdf, random);
    if (el < 0) {
        el = -(el + 1);
    }
    System.out.println(el);
}

P.S. When the list of probabilities is short, a simple linear scan might turn out to be as efficient as binary search.