0
votes

I'm currently trying to figure out the most efficient way to create a numpy array in a loop, here are the examples:

import numpy as np
from time import time
tic = time()
my_list = range(1000000)
a = np.zeros((len(my_list),))
for i in my_list:
   a[i] = i
toc = time()
print(toc-tic)

vs

tic = time()
a = []
my_list = range(1000000)
for i in my_list:
    a.append(i)
a = np.array(a)
toc = time()

print(toc-tic)

I was expecting that the second one would be much slower than the first one, because of the need of new memory at each step of the for loop, however these are roughly the same and I was wondering why, but just for curiosity because I can do it with both.

I actually want to write a simple numpy array with data extracted from a dataframe and it looks quite messy. I was wondering if there would be a more pythonic way to do it. I have this dataframe and a list of labels that I need and the simpliest idea would be to do the following (the value I need is the last one of each column):

vars_outputs = ["x1", "x2", "ratio_x1_x2"]
my_df = pd.read_excel(path)
outpts = np.array(my_df[vars_outputs][-1])

However it is not possible because some of the labels I want are not directly available in the dataframe : for example the ratio_x1_x2 need to be computed from the two first columns. So I added a dict with the missing label and the way to compute them (it's only ratio):

missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}

and check the condition and create the numpy array (hence the previous question about efficiency)

outpts = []
for var in vars_outputs:
    if var in missing_labels.keys():
        outpts.append(my_df[missing_labels[var][0]][-1]/my_df[missing_labels[var][1]][-1])
    else:
        outpts.append(my_df[var][-1])
outpts = np.array(outpts)

It seems to me way too complicated but I cannot think of an easier way to do so (especially because I need to have this specific order in my numpy output array)

The other idea I have is to add columns in the dataframe with the operation I want but because there are roughly 8000 labels I don't know if it's the best to do because I would have to look into all these labels after this preprocessing step

Thanks a lot

2
In previous questions the list append and indexing methods have always been competitive. Sometimes the list append can changed to a list comprehension. You could also test fromiter [sp?] and frompyfunc.hpaulj
Can you include a complete example, i.e. with data and desired results? That will make it much easier to give you a complete and useful answer.user2699
"because of the need of new memory at each step of the for loop" - it doesn't need new memory at every step. Lists resize by a multiplicative factor to avoid that.user2357112

2 Answers

1
votes

Here is the final code, np.fromiter() does the trick and allows to reduce the number of lines by using list comprehension

df = pd.read_excel(path)
print(df.columns)

It outputs ['x1', 'x2']

vars_outputs = ["x1", "x2", "ratio_x1_x2"]
missing_labels = {"ratio_x1_x2" : ["x1", "x2"]}

it = [df[missing_labels[var][0]].iloc[-1]/df[missing_labels[var][1]].iloc[-1] if var in missing_labels
        else df[var].iloc[-1] for var in vars_outputs]

t = np.fromiter(it, dtype = float)
0
votes

Thanks @hpaulj, that might be very useful for me in future. I wasn't aware of the speed up using fromiter()

import timeit
setup = '''
import numpy as np
H, W = 400, 400
it = [(1 + 1 / (i + 0.5)) ** 2 for i in range(W) for j in range(H)]'''

fns = ['''
x = np.array([[(1 + 1 / (i + 0.5)) ** 2 for i in range(W)] for j in range(H)])
''', '''
x = np.fromiter(it, np.float)
x.reshape(H, W)
''']
for f in fns:
  print(timeit.timeit(f,setup=setup, number=100))
# gives me
# 6.905218548999983
# 0.5763416080008028

EDIT PS your for loop could be some kind of iterator like

it = [my_df[missing_labels[var][0]][-1]
        / my_df[missing_labels[var][1]][-1] if var in missing_labels
        else my_df[var][-1] for var in var_outputs]