7
votes

NumPy seems to lack built-in support for 3-byte and 6-byte types, aka uint24 and uint48. I have a large data set using these types and want to feed it to numpy. What I currently do (for uint24):

import numpy as np
dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))])
# I would like to be able to write
#  dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))])
#  dt = np.dtype([('head', '<u2'), ('data', '<u6')])
a = np.memmap("filename", mode='r', dtype=dt)
# convert 3 x 2byte data to 2 x 3byte
# w1 is LSB, w3 is MSB
w1, w2, w3 = a['data'].swapaxes(0,1)
a2 = np.ndarray((2,a.size), dtype='u4')
# 3 LSB
a2[0] = w2 % 256
a2[0] <<= 16
a2[0] += w1
# 3 MSB
a2[1] = w3
a2[1] <<=8
a2[1] += w2 >> 8
# now a2 contains "uint24" matrix

While it works for 100MB input, it looks inefficient (think of 100s GBs of data). Is there a more efficient way? For example, creating a special kind of read-only view which masks part of the data would be useful (kind of "uint64 with two MSBs always zero" type). I only need read-only access to the data.

3

3 Answers

8
votes

I don't believe there's a way to do what you're asking (it would require unaligned access, which is highly inefficient on some architectures). My solution from Reading and storing arbitrary byte length integers from a file might be more efficient at transferring the data to an in-process array:

a = np.memmap("filename", mode='r', dtype=np.dtype('>u1'))
e = np.zeros(a.size / 6, np.dtype('>u8'))
for i in range(3):
    e.view(dtype='>u2')[i + 1::4] = a.view(dtype='>u2')[i::3]

You can get unaligned access using the strides constructor parameter:

e = np.ndarray((a.size - 2) // 6, np.dtype('<u8'), buf, strides=(6,))

However with this each element will overlap with the next, so to actually use it you'd have to mask out the high bytes on access.

2
votes

There's an answer for this over at: How do I create a Numpy dtype that includes 24 bit integers?

It's a bit ugly, but does exactly what you want: Allows you to index your ndarray like it's got a dtype of <u3 so you can memmap() big data from disk.
You still need to manually apply a bitmask to clear out the fourth overlapping byte, but that can be applied to the sliced (multidimensional) array after access.

The trick is to abuse the 'stride' part of an ndarray, so that indexing works. In order to make it work without it complaining about limits, there's a special trick.

0
votes

Using the code below you can read integers of any size coded as big or little endian:

def readBigEndian(filename, bytesize):
    with (open(filename,"rb")) as f:
         str = f.read(bytesize)
         while len(str)==bytesize:
             int = 0;
             for byte in map(ord,str):
                 print byte
                 int = (int << 8) | byte
             yield(int)
             str = f.read(bytesize)

def readLittleEndian(filename, bytesize):
    with (open(filename,"rb")) as f:
         str = f.read(bytesize)
         while len(str)==bytesize:
             int = 0;
             shift = 0
             for byte in map(ord,str):
                 print byte
                 int |= byte << shift
                 shift += 8
             yield(int)
             str = f.read(bytesize)

for i in readLittleEndian("readint.py",3):
    print i