I have a large 3D HDF5 dataset that represents location (X,Y) and time for a certain variable. Next, I have a 2D numpy array containing a classification for the same (X,Y) location. What I would like to achieve is that I can extract all time series from the 3D HDF5 dataset that fall under a certain class in the 2D array.
Here's my example:
import numpy as np
import h5py
# Open the HDF5 dataset
NDVI_file = 'NDVI_values.hdf5'
f_NDVI = h5py.File(NDVI_file,'r')
NDVI_data = f_NDVI["NDVI"]
# See what's in the dataset
NDVI_data
<HDF5 dataset "NDVI": shape (1319, 2063, 53), type "<f4">
# Let's make a random 1319 x 2063 classification containing class numbers 0-4
classification = np.random.randint(5, size=(1319, 2063))
Now we have our 3D HDF5 dataset and a 2D classification. Let's look for pixels that fall under class number '3'
# Look for the X,Y locations that have class number '3'
idx = np.where(classification == 3)
This returns me a tuple of size 2 that contains the X,Y pairs that match the condition, and in my random example the amount of pairs is 544433. How should I use now this idx
variable to create a 2D array of size (544433,53) that contains the 544433 time series of the pixels that have classification class number '3'?
I did some testing with fancy indexing and pure 3D numpy arrays and this example would work just fine:
subset = 3D_numpy_array[idx[0],idx[1],:]
However, the HDF5 dataset is too large to convert to a numpy array; when I'm trying to use the same indexing method directly on the HDF5 dataset:
# Try to use fancy indexing directly on HDF5 dataset
NDVI_subset = np.array(NDVI_data[idx[0],idx[1],:])
It throws me an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (C:\aroot\work\h5py\_objects.c:2584)
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (C:\aroot\work\h5py\_objects.c:2543)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\dataset.py", line 431, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 95, in select
sel[args]
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 429, in __getitem__
raise TypeError("Indexing elements must be in increasing order")
TypeError: Indexing elements must be in increasing order
Another thing I tried is to np.repeat
the classification array in the 3rd dimension to create a 3D array that matches the shape of the HDF5 dataset. The idx
variable than gets a tuple of size 3:
classification_3D = np.repeat(np.reshape(classification,(1319,2063,1)),53,axis=2)
idx = np.where(classification == 3)
But the following statement than throws the exact same error:
NDVI_subset = np.array(NDVI_data[idx])
Is this because the HDF5 dataset works differently compared to a pure numpy array? The documentation does say "Selection coordinates must be given in increasing order"
Does anyone in that case have a suggestion on how I could get this to work without having to read in the full HDF5 dataset into memory (which does not work)? Thank you very much!
h5py
doc say about advanced or fancy indexing? I'd study that, and then set up a much smaller test case where I can test this sort of indexing on 2d array before moving two 3d. And where I can print all values. It is certainly possible that H5 is more limited in indexing. – hpaulj