Fast way to generate pseudo-random bits with a given probability of 0 or 1 for each bit

Question

Normally, a random number generator returns a stream of bits for which the probability to observe a 0 or a 1 in each position is equal (i.e. 50%). Let's call this an unbiased PRNG.

I need to generate a string of pseudo-random bits with the following property: the probability to see a 1 in each position is p (i.e. the probability to see a 0 is 1-p). The parameter p is a real number between 0 and 1; in my problem it happens that it has a resolution of 0.5%, i.e. it can take the values 0%, 0.5%, 1%, 1.5%, ..., 99.5%, 100%.

Note that p is a probability and not an exact fraction. The actual number of bits set to 1 in a stream of n bits must follow the binomial distribution B(n, p).

There is a naive method that can use an unbiased PRNG to generate the value of each bit (pseudocode):

generate_biased_stream(n, p):
  result = []
  for i in 1 to n:
    if random_uniform(0, 1) < p:
      result.append(1)
    else:
      result.append(0)
  return result

Such an implementation is much slower than one generating an unbiased stream, since it calls the random number generator function once per each bit; while an unbiased stream generator calls it once per word size (e.g. it can generate 32 or 64 random bits with a single call).

I want a faster implementation, even it it sacrifices randomness slightly. An idea that comes to mind is to precompute a lookup table: for each of the 200 possible values of p, compute C 8-bit values using the slower algorithm and save them in a table. Then the fast algorithm would just pick one of these at random to generate 8 skewed bits.

A back of the envelope calculation to see how much memory is needed: C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024. Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.

A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions.

I would like to know if there is a faster method, based on bit operations instead of lookup tables. For example modifying the random number generation code directly to introduce a bias for each bit. This would achieve the same performance as an unbiased PRNG.

Edit March 5

Thank you all for your suggestions, I got a lot of interesting ideas and suggestions. Here are the top ones:

Change the problem requirements so that p has a resolution of 1/256 instead of 1/200. This allows using bits more efficiently, and also gives more opportunities for optimization. I think I can make this change.
Use arithmetic coding to efficiently consume bits from an unbiased generator. With the above change of resolution this becomes much easier.
A few people suggested that PRNGs are very fast, thus using arithmetic coding might actually make the code slower due to the introduced overhead. Instead I should always consume the worst-case number of bits and optimize that code. See the benchmarks below.
@rici suggested using SIMD. This is a nice idea, which works only if we always consume a fixed number of bits.

Benchmarks (without arithmetic decoding)

Note: as many of you have suggested, I changed the resolution from 1/200 to 1/256.

I wrote several implementations of the naive method that simply takes 8 random unbiased bits and generates 1 biased bit:

Without SIMD
With SIMD using the Agner Fog's vectorclass library, as suggested by @rici
With SIMD using intrinsics

I use two unbiased pseudo-random number generators:

xorshift128plus
Ranvec1 (Mersenne Twister-like) from Agner Fog's library.

I also measure the speed of the unbiased PRNG for comparison. Here are the results:


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

SIMD increases performance by a factor of 3 compared to the scalar method. It is 8 times slower than the unbiased generator, as expected.

The fastest biased generator achieves 2.1 Gb/s.


RNG: xorshift128plus

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

For xorshift, SIMD increases performance by a factor of 5 compared to the scalar method. It is 4 times slower than the unbiased generator. Note that this is a scalar implementation of xorshift.

The fastest biased generator achieves 4.9 Gb/s.


RNG: xorshift128plus_avx2

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical   : 104,857,600

This implementation uses AVX2 to run 4 unbiased xorshift generators in parallel.

The fastest biased generator achieves 19.5 Gb/s.

Benchmarks for arithmetic decoding

Simple tests show that the arithmetic decoding code is the bottleneck, not the PRNG. So I am only benchmarking the most expensive PRNG.


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical   : 10,240,000

Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical   : 10,240,000

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

The simple fixed point method achieves 0.25 Gb/s, while the naive scalar method is 3x faster, and the naive SIMD method is 8x faster. There might be ways to optimize and/or parallelize the arithmetic decoding method further, but due to its complexity I have decided to stop here and choose the naive SIMD implementation.

Thank you all for the help.

Maybe you could read 200*n/8 bytes from random device and then draw vector of samples with skewed distribution from it. This would result in just one call to random device, but n computations of sum of bits. — 4pie0
You're using arithmetic coding backwards: You are encoding (compressing), not decompressing. So turn your tables around: In the p = 1/4 case, if your first random bit is 1 you can already output biased 00 (if you see a 0 you have to wait), and so on. As @MarkDickinson's benchmark shows, for p = 0.1 you can get more than two output bits per input bit. — alexis
I think I swapped the probability of 0 and 1; but hope you see what I mean. — alexis
Would any of the close-voters care to elaborate? I don't see how this is too broad or primarily opinion-based. — m69 ''snarky and unwelcoming''

mindriot mindriot · Accepted Answer · 2016-03-04T12:28:01

One thing you can do is to sample from the underlying unbiased generator multiple times, getting several 32-bit or 64-bit words, and then performing bitwise boolean arithmetic. As an example, for 4 words b1,b2,b3,b4, you can get the following distributions:

    expression             | p(bit is 1)
    -----------------------+-------------
    b1 & b2 & b3 & b4      |  6.25%
    b1 & b2 & b3           | 12.50%
    b1 & b2 & (b3 | b4)    | 18.75%
    b1 & b2                | 25.00%
    b1 & (b2 | (b3 & b4))  | 31.25%
    b1 & (b2 | b3)         | 37.50%
    b1 & (b2 | b3 | b4))   | 43.75%
    b1                     | 50.00%

Similar constructions can be made for finer resolutions. It gets a bit tedious and still requires more generator calls, but at least not one per bit. This is similar to a3f's answer, but is probably easier to implement and, I suspect, faster than scanning words for 0xF nybbles.

Note that for your desired 0.5% resolution, you would need 8 unbiased words for one biased word, which would give you a resolution of (0.5^8) = 0.390625%.

Fast way to generate pseudo-random bits with a given probability of 0 or 1 for each bit

Edit March 5

Benchmarks (without arithmetic decoding)

Benchmarks for arithmetic decoding

11 Answers

Method 1: arithmetic (de)coding

Method 2: integer-based algorithm

What it does

How it does that