Variable/unknown length string/unicode dtype in numpy

Question

Is it possible to somehow load an array with a text field of unknown field length?

I figured out how to pass dtype to get string into it. However, with out specifying length i just get U0. Type which seems not to be able to hold any data. E.g:

data = io.StringIO("test data lololol\ntest2 d4t4 ololol")
>>> ar = numpy.loadtxt(data, dtype=[("1",str), ("2",'S'), ("3",'S')])
>>> ar
array([('', b'', b''), ('', b'', b'')], 
      dtype=[('1', '<U0'), ('2', '|S0'), ('3', '|S0')])

When I change to mode with specified size I get input:

>>> data.seek(0)
0
>>> numpy.loadtxt(data, dtype=[("1",(str,30)), ("2",(str,30)), ("3",('S',30))])
array([("b'test'", "b'data'", b'lololol'),
       ("b'test2'", "b'd4t4'", b'ololol')], 
      dtype=[('1', '<U30'), ('2', '<U30'), ('3', '|S30')])

I'd be fine with either S or U probably. The field in my case is supposed to be used to hold set of textual flags. Something like linux environmental variables. Thus, preallocating large space just in case seems like a big waste. Especially when number of rows goes into millions.

I do understand, or have ideas, where such design can come from. Like constructing a struct like object that holds whole row in continuous memory block. However, I thought maybe there could a way to make it keep like a pointer in case of strings.

Is it possible?

Not the task numpy is suited for. You can for example encode your strings (i.e. manually construct pointers) with say hash function, and store theme elsewhere. — alko

hpaulj hpaulj · Accepted Answer · 2013-12-18T06:57:44

getting indices in numpy uses np.recfromtxt, which can generate the dtype automatically. Effectively it calls np.genfromtxt with a dtype=None.

Data like:

david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160

produces a:

array([('david', 'weight_2005', 50), ('david', 'weight_2012', 60),
       ('david', 'height_2005', 150), ('david', 'height_2012', 160),...], 
      dtype=[('f0', 'S5'), ('f1', 'S11'), ('f2', '<i4')])

The code in genfromtxt for determining dtype looks complex. My guess it adjusts the Snn to accommodate the longest string that it encounters in that field.

One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.

x=np.genfromtxt('stack19944408.txt',dtype=None,names=['one','two','thr'])
x.astype(dtype=[('one','S10'),('two','S10'),('thr','f')])
#array([('david', 'weight_200', 50.0), ('david', 'weight_201', 60.0),
#       ...
#      dtype=[('one', 'S10'), ('two', 'S10'), ('thr', '<f4')])

Variable/unknown length string/unicode dtype in numpy

1 Answers