Indexing a large 3D HDF5 dataset for subsetting based on 2D condition

Question

I have a large 3D HDF5 dataset that represents location (X,Y) and time for a certain variable. Next, I have a 2D numpy array containing a classification for the same (X,Y) location. What I would like to achieve is that I can extract all time series from the 3D HDF5 dataset that fall under a certain class in the 2D array.

Here's my example:

import numpy as np
import h5py

# Open the HDF5 dataset
NDVI_file = 'NDVI_values.hdf5'
f_NDVI = h5py.File(NDVI_file,'r')
NDVI_data = f_NDVI["NDVI"]

# See what's in the dataset
NDVI_data
<HDF5 dataset "NDVI": shape (1319, 2063, 53), type "<f4">

# Let's make a random 1319 x 2063 classification containing class numbers 0-4
classification = np.random.randint(5, size=(1319, 2063))

Now we have our 3D HDF5 dataset and a 2D classification. Let's look for pixels that fall under class number '3'

# Look for the X,Y locations that have class number '3'
idx = np.where(classification == 3)

This returns me a tuple of size 2 that contains the X,Y pairs that match the condition, and in my random example the amount of pairs is 544433. How should I use now this idx variable to create a 2D array of size (544433,53) that contains the 544433 time series of the pixels that have classification class number '3'?

I did some testing with fancy indexing and pure 3D numpy arrays and this example would work just fine:

subset = 3D_numpy_array[idx[0],idx[1],:]

However, the HDF5 dataset is too large to convert to a numpy array; when I'm trying to use the same indexing method directly on the HDF5 dataset:

# Try to use fancy indexing directly on HDF5 dataset
NDVI_subset = np.array(NDVI_data[idx[0],idx[1],:])

It throws me an error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper     (C:\aroot\work\h5py\_objects.c:2584)
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (C:\aroot\work\h5py\_objects.c:2543)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\dataset.py", line 431, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 95, in select
sel[args]
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 429, in __getitem__
raise TypeError("Indexing elements must be in increasing order")
TypeError: Indexing elements must be in increasing order

Another thing I tried is to np.repeat the classification array in the 3rd dimension to create a 3D array that matches the shape of the HDF5 dataset. The idx variable than gets a tuple of size 3:

classification_3D = np.repeat(np.reshape(classification,(1319,2063,1)),53,axis=2)
idx = np.where(classification == 3)

But the following statement than throws the exact same error:

NDVI_subset = np.array(NDVI_data[idx])

Is this because the HDF5 dataset works differently compared to a pure numpy array? The documentation does say "Selection coordinates must be given in increasing order"

Does anyone in that case have a suggestion on how I could get this to work without having to read in the full HDF5 dataset into memory (which does not work)? Thank you very much!

What does h5py doc say about advanced or fancy indexing? I'd study that, and then set up a much smaller test case where I can test this sort of indexing on 2d array before moving two 3d. And where I can print all values. It is certainly possible that H5 is more limited in indexing. — hpaulj

hpaulj hpaulj · Accepted Answer · 2016-08-04T18:47:03

Advanced/fancy indexing in h5py is not nearly as general as with np.ndarray.

Set up a small test case:

import h5py
f=h5py.File('test.h5','w')
dset=f.create_dataset('data',(5,3,2),dtype='i')
dset[...]=np.arange(5*3*2).reshape(5,3,2)
x=np.arange(5*3*2).reshape(5,3,2)

ind=np.where(x%2)

I can select all odd values with:

In [202]: ind
Out[202]: 
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=int32),
 array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32))

In [203]: x[ind]
Out[203]: array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])
In [204]: dset[ind]
...
TypeError: Indexing elements must be in increasing order

I can index on a single dimension with a list like: dset[[1,2,3],...], but repeating index values or changing the order produces an error, dset[[1,1,2,2],...] or dset[[2,1,0],...]. dset[:,[0,1],:] is ok.

Several slices is fine, dset[0:3,1:3,:], or a slice and list, dset[0:3,[1,2],:].

But 2 lists dset[[0,1,2],[1,2],:] produces a

TypeError: Only one indexing vector or array is currently allowed for advanced selection

So the index tuple for np.where is wrong in several ways.

I don't know how much of this is a constraint of the h5 storage, and how much is just incomplete development in the h5py module. Maybe bit of both.

So you need to load simpler chunks from the file, and perform the fancier indexing on the resulting numpy arrays.

In my odd values case, I just need to do:

In [225]: dset[:,:,1]
Out[225]: 
array([[ 1,  3,  5],
       [ 7,  9, 11],
       [13, 15, 17],
       [19, 21, 23],
       [25, 27, 29]])

Indexing a large 3D HDF5 dataset for subsetting based on 2D condition

1 Answers