0
votes

I wish to create a h5py "string" dataset (for example "A"), using the data type "array of 8-bit integers (80)" (as shown in HDFView, see here). Each integer of this array of length 80 is in fact ord(x) of the corresponding character of this string. So for instance Top is stored as 84 111 112 0 0 0 ..., with in total 80 int8.

The desired dataset should look like this

DATASET "NOM" {
                     DATATYPE  H5T_ARRAY { [80] H5T_STD_I8LE }
                     DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
                     DATA {
                     (0): [ 84, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
                     }

However I'm unable to create this dataset using h5py. Using a standard numpy array gives this

DATASET "NOM" {
                     DATATYPE  H5T_STD_I8LE
                     DATASPACE  SIMPLE { ( 1, 80 ) / ( 1, 80 ) }
                     DATA {
                     (0,0): 84, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     (0,15): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     (0,31): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     (0,47): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     (0,63): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     (0,79): 0
                     }
                  }

So what is data and dtype needed, if my string, say, is "Top".

.create_dataset("NOM", data=data, dtype=dtype)

According to https://github.com/h5py/h5py/issues/955, maybe I need to use a lower level interface...?

Thanks!

Solution

The problem is that if we create the numpy dataset data before writing it by using .create_dataset("NOM", data=data), internally numpy will always interpret my 80int8 data type as a 1d array of int8

dtype = np.dtype("80int8")
x = np.array(2, dtype=dtype)
# x.dtype = dtype('int8')

The solution is thus to declare the data set with the desired dtype first, then fill in the data.

dataset = gro.create_dataset("NOM", (len(nom),), dtype="80int8")
for i in range(len(nom)):
    nom_80 = nom[i] + "\x00" * (80 - len(nom[i]))  # make nom 80 characters
    dataset[i] = [ord(x) for x in nom_80]
# dataset.dtype = dtype(('i1', (80,)))
1

1 Answers

0
votes

Make a uint8 array of right size and content:

In [417]: x = np.zeros(80, dtype='uint8')                                                 
In [419]: x[:3]=[ord(i) for i in 'Top']                                                                                                                                
In [421]: ds1=hf.create_dataset('other4', data=x) 

A structured array approach:

In [486]: dt = np.dtype([('f0','80int8')])                                                
In [487]: dt                                                                              
Out[487]: dtype([('f0', 'i1', (80,))])
In [488]: x = np.zeros(1, dt)                                                             
In [489]: x['f0'][0][:3]=[ord(i) for i in 'Top']                                          
In [490]: x                                                                               
Out[490]: 
array([([ 84, 111, 112,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],)],
      dtype=[('f0', 'i1', (80,))])
In [491]: ds1=hf.create_dataset('st1', data=x)                                            
In [492]: ds1                                                                             
Out[492]: <HDF5 dataset "st1": shape (1,), type "|V80">

produces

   DATASET "st1" {
      DATATYPE  H5T_COMPOUND {
         H5T_ARRAY { [80] H5T_STD_I8LE } "f0";
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): {
            [ 84, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
         }
      }
   }