Use numpy to generate random data with case-when condition

Question

Question

I want to generate data to be modeled. Based on the following logic:

Input: 2 numpy array named z and d.

z : 1d array, value 0/1

d : 1d array, value 0/1

Return: y: 1d array. value: norm random numbers.

If z == 0 and d==0, y ~ norm(1,1),

if z == 0 and d == 1, y~ norm(0,1),

if z == 1 and d == 0, y ~ norm(1,1),

if z == 1 and d == 1, y ~ norm(2,1).

I want to do it in a super fast,clear and pythonic way.

It seems basic mathmatics and np.where is faster. In this case, I only have 3 conditions (you can see clearly from basic mathmatics part). If I have 10 or more condition, type them in if-else statement is sometimes confusing. I want to perform data simulations, which means I will generate data in million times at different n.So, what is the best way to do it?

What I have tried:

# generate data
n = 2000
z = np.random.binomial(1,0.5,n)
d = np.random.binomial(1,0.5,n)

dict case-when

def myfun(x):
    return {(0,1):np.random.normal(0,1),\
            (0,0):np.random.normal(1,1),\
            (1,0):np.random.normal(1,1),\
            (1,1):np.random.normal(2,1)}[x]
%%timeit
y = [myfun(i) for i in zip(z,d)]

Out:

16.2 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

simple `if-else`

%%timeit
y = np.random.normal([0 if (i == 0) & (j ==1) else 2 if (i == 1) & (j == 1) else 1 for i,j in zip(z,d)],1)

Out:

1.38 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

basic mathmatics

%%timeit
h0 = np.random.normal(0,1,n)
h1 = np.random.normal(1,1,n)
h2 = np.random.normal(2,1,n)
y = (1-z)*d*h0 + (1-d)*h1 + z*d*h2

Out:

140 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

np.where

%%timeit
h0 = np.random.normal(0,1,n)
h1 = np.random.normal(1,1,n)
h2 = np.random.normal(2,1,n)
y = np.where((d== 0),h1,0) + np.where((z ==1) & (d== 1),h2,0) + np.where((z ==0) & (d== 1),h0,0)

Out:

156 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Is there any other new method?

I'm certain your if-else doesn't do what you think it does so make sure all your options do the same thing first. — Andras Deak
Better, but you should also take a closer look at i == 0 & j ==1. — Andras Deak

Jurgen Strydom Jurgen Strydom · Accepted Answer · 2020-01-10T06:04:22

It seems you have already done the legwork here. The outcome is now based on trade-offs. All of the above solutions meet the criteria to a different extent.

Is this code to be used to teach other people? Maybe it is executed only once or twice a day? Its part of a larger project and needs to be extremely clear for someone else to maintain? If this is the case choose the slower but easier to understand and read options.

Is it executed thousands or millions of times a day? Resources cost money that reduces profit from a product? If so comment it super well and use the faster options.

It seems the basic mathematics option is the best tradeoff as it is simple, easy to understand yet quick to execute.

My biased review of each method:

dict case-when: slow, will require multiple readings / tests to fully understand what is actually happening and to figure out if there are any unknown gotchas.
simple if-else: slow, will require multiple readings / tests to fully understand what is actually happening and to figure out if there are any unknown gotchas.
basic mathematics: fast, easy to understand if you have even a minor mathematical background (which should include most programmers).
np.where: fast, gets the job done, , will require multiple readings / tests to fully understand what is actually happening but less prone to gotchas because its array based.

For reference here are the pythonic philosophy of writing code:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.

Using the above as criteria its much easier to evaluate whether your code is pythonic or not.

Use numpy to generate random data with case-when condition

Question

What I have tried:

dict case-when

simple if-else

basic mathmatics

np.where

2 Answers

simple `if-else`