2
votes

Question

I want to generate data to be modeled. Based on the following logic:

Input: 2 numpy array named z and d.

z : 1d array, value 0/1

d : 1d array, value 0/1

Return: y: 1d array. value: norm random numbers.

If z == 0 and d==0, y ~ norm(1,1),

if z == 0 and d == 1, y~ norm(0,1),

if z == 1 and d == 0, y ~ norm(1,1),

if z == 1 and d == 1, y ~ norm(2,1).

I want to do it in a super fast,clear and pythonic way.

It seems basic mathmatics and np.where is faster. In this case, I only have 3 conditions (you can see clearly from basic mathmatics part). If I have 10 or more condition, type them in if-else statement is sometimes confusing. I want to perform data simulations, which means I will generate data in million times at different n.So, what is the best way to do it?

What I have tried:

# generate data
n = 2000
z = np.random.binomial(1,0.5,n)
d = np.random.binomial(1,0.5,n)

dict case-when

def myfun(x):
    return {(0,1):np.random.normal(0,1),\
            (0,0):np.random.normal(1,1),\
            (1,0):np.random.normal(1,1),\
            (1,1):np.random.normal(2,1)}[x]
%%timeit
y = [myfun(i) for i in zip(z,d)]

Out:

16.2 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

simple if-else

%%timeit
y = np.random.normal([0 if (i == 0) & (j ==1) else 2 if (i == 1) & (j == 1) else 1 for i,j in zip(z,d)],1)

Out:

1.38 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

basic mathmatics

%%timeit
h0 = np.random.normal(0,1,n)
h1 = np.random.normal(1,1,n)
h2 = np.random.normal(2,1,n)
y = (1-z)*d*h0 + (1-d)*h1 + z*d*h2

Out:

140 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

np.where

%%timeit
h0 = np.random.normal(0,1,n)
h1 = np.random.normal(1,1,n)
h2 = np.random.normal(2,1,n)
y = np.where((d== 0),h1,0) + np.where((z ==1) & (d== 1),h2,0) + np.where((z ==0) & (d== 1),h0,0)

Out:

156 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Is there any other new method?

2
I'm certain your if-else doesn't do what you think it does so make sure all your options do the same thing first.Andras Deak
@AndrasDeak sorry, revised my postTravis
Better, but you should also take a closer look at i == 0 & j ==1.Andras Deak

2 Answers

1
votes

It seems you have already done the legwork here. The outcome is now based on trade-offs. All of the above solutions meet the criteria to a different extent.

Is this code to be used to teach other people? Maybe it is executed only once or twice a day? Its part of a larger project and needs to be extremely clear for someone else to maintain? If this is the case choose the slower but easier to understand and read options.

Is it executed thousands or millions of times a day? Resources cost money that reduces profit from a product? If so comment it super well and use the faster options.

It seems the basic mathematics option is the best tradeoff as it is simple, easy to understand yet quick to execute.

My biased review of each method:

  • dict case-when: slow, will require multiple readings / tests to fully understand what is actually happening and to figure out if there are any unknown gotchas.
  • simple if-else: slow, will require multiple readings / tests to fully understand what is actually happening and to figure out if there are any unknown gotchas.
  • basic mathematics: fast, easy to understand if you have even a minor mathematical background (which should include most programmers).
  • np.where: fast, gets the job done, , will require multiple readings / tests to fully understand what is actually happening but less prone to gotchas because its array based.

For reference here are the pythonic philosophy of writing code:

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.

Using the above as criteria its much easier to evaluate whether your code is pythonic or not.

1
votes

I would think that the fastest option is to generate your random numbers only once by making use of array-valued parameters to normal. Using the new random API:

import numpy as np

rng = np.random.default_rng()

# generate data
n = 2000
z = rng.binomial(1, 0.5, n)
d = rng.binomial(1, 0.5, n)

def generate_once(z, d):
    """Generate randoms for https://stackguides.com/questions/59676147"""

    # encode mean; scale is always 1 anyway in the example
    means = np.zeros_like(z, dtype=float)
    z_inds = z == 0
    d_inds = d == 0
    means[d_inds] = 1
    means[z_inds & ~d_inds] = 2

    # generate the data
    y = rng.normal(means)
    return y

y = generate_once(z, d)

I didn't try to time this against all the others, but I'd expect this to be competitive. Consider it a faster variant of your if-else. Taking shortcuts in mapping out the means (and in general, also the scales) as an array can reduce overhead, and generating every normal number exactly once should shave off runtime.