What are the pros and cons of using a structured array in Numpy instead of several arrays?

Question

I would like to use Numpy to gather objects with the same properties in a efficient way. I don't know what to choose between using one structured array or several arrays.

For example, let's consider an object Item and its properties id (4 byte unsigned integer), name (20 unicode characters), price (4 byte float).

Using a structured array:

import numpy as np
item_dtype = np.dtype([('id', 'u4'), ('name', 'U20'), ('price', 'f4')])

# Populate:
raw_items = [(12, 'Bike', 180.54), (33, 'Helmet', 46.99)]
my_items_a = np.array(raw_items, dtype=item_dtype)

# Access:
my_items_a[0] # first item
my_items_a['price'][1] # price of second item

Using several arrays, wrapped in a class for convience:

class Items:
    def __init__(self, raw_items):
       n = len(raw_items)

       id, name, price = zip(*raw_items)

       self.id = np.array(id, dtype='u4')
       self.name = np.array(name, dtype='U20')
       self.price = np.array(price, dtype='f4')

# Populate:
my_items_b = Items(raw_items)

# Access:
(my_items_b.id[0], my_items_b.name[0], my_items_b.price[0]) # first item
my_items_b.price[1] # price of second item

What are the pros and cons of these two approaches? When using one instead of the other? Thanks

Ami Tavory Ami Tavory · Accepted Answer · 2016-01-28T13:50:29

At least one consideration is that of locality of reference.

In general, it's a good idea to structure the memory layout so that when you access some memory location, there's a good chance you'll access locations right near by. This will increase cache performance.

Consequently, regardless of the logical meaning of the data:

If you have many operations where you will calculate something over all the fields of one record, then all the fields of the next record, etc., then you might consider records.
If you have many operations where you will calculate something over a single field for all entries, then something else for a different field for all entries, etc., then you might consider several arrays.

Asides from this, there's also the question of code clarity and ease of maintenance, so it's not a hard and fast rule. Also, in general, YMMV, so you should profile and instrument different options.

What are the pros and cons of using a structured array in Numpy instead of several arrays?

1 Answers