0
votes

Is it possible to load partial chunks of a DataArray (stored as single netcdf file) from disk into memory (i.e. not load the whole data-array at once) but without using dask-dataarrays?

The issue is that I'm using dask as my cluster scheduler to submit jobs and within those jobs - I want to page a dataarray into memory from disk in small pieces. Dask unfortunately does not like nested dask-schedulers so trying to load that dataarray as per da = xr.open_datarray( file, chunks={'time':1000} ) doesn't work (causes dask to throw nested daemonic process errors).

Ideally, I'd like to do something like this - without having the whole dataarray loaded into memory, but only the relevant pieces:

da = xr.open_datarray( my_file )  # lazy open the file
for t in range( 0, len( da ), 1000 ) :
    da_actual = da[t:t+1000].load() # materialize the data into memory
    # do some compute with da_actual

Any pointers / ideas on how to achieve this would be appreciated

1

1 Answers

0
votes

Wrapping this with delayed might help:

import dask

@dask.delayed
def custom_array_func(my_file):
    da = xr.open_datarray( my_file )  # lazy open the file
        for t in range( 0, len( da ), 1000 ) :
            da_actual = da[t:t+1000].load() # materialize the data into memory
            # do some compute with da_actual
    return final_result # or can return None if nothing is needed

[computed_results] = dask.compute([custom_array_func(my_file) for my_file in list_of_files])