Tested on python 3.7, numpy 1.17.3:
it seems, that the random number generation with numpy when using a fixed seed and multithreading is not providing consistent results. This issue does not come up with scipy. The following snippet shows the problem:
import numpy as np
from scipy.stats import nbinom
from concurrent.futures import ThreadPoolExecutor, as_completed
def load_data_np():
np.random.seed(0)
return np.random.negative_binomial(5, 0.3, size=2)
def load_data_scipy():
return nbinom.rvs(5, 0.3, size=2, random_state=0)
These two methods should thus produce always the same numbers. But when producing the data in threaded loop...
with ThreadPoolExecutor() as executor:
futures = list(
(executor.submit(load_data_np)
for i in range(1000))
)
print(np.diff([future.result() for future in as_completed(futures)]))
on can find such values among the output of numpy:
...
[ 4]
[ -3]
[-15]
[ -3]
[ 5]
[ -6]
[ 0]
[ 6]
[ 1]
[-13]
[ -7]
[ 3]
[ 6]
[ -2]
[ -1]
[-11]
[ 3]
...
This must mean, that inbetween subsequent computations for the 2 samples (size=2) the random seed must have been reset by another thread, which throws the other threads off in their rng count. Just to compare this to scipy:
with ThreadPoolExecutor(max_workers=cpu_count()) as executor:
futures = list(
(executor.submit(load_data_scipy)
for i in range(1000))
)
print(np.diff([future.result() for future in as_completed(futures)]))
yields the same values every iteration
...
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
[-11]
...
So what is the proper way of thread-safe rng with a fixed seed in numpy? Googling the issue has lead me back to np.random.seed.
Cheers, Michael