I'm working with a set of 468 netcdf files summing up to 12GB in total. Each file has only one global snapshot of a geophysical variable, i.e. for each file the data shape is (1, 1801, 3600)
corresponding to dimensions ('time', 'latitude', 'longitude')
.
My RAM is 8GB so I need chunking. I'm creating a xarray dataset using xarray.open_mfdataset
and I have found that using the parameter chunk when calling xarray.open_mfdataset
or doing a rechunking after with method .chunk
has totally different outcomes. A similar issues was reported here without getting any response.
From the xarray documentation, chunking when calling xarray.open_dataset
or when rechunking with .chunk
should be exactly equivalent...
http://xarray.pydata.org/en/stable/dask.html
...but it doesn't seem so. I share here my examples.
1) CHUNKING WHEN CALLING xarray.open_mfdataset
ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.
import xarray as xr
data1 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
concat_dim='time', combine='nested',
chunks = {'longitude':400, 'latitude':200}) \
.chunk({'time':-1})
data1.t2m.data
with ProgressBar():
data1.std('time').compute()
[########################################] | 100% Completed | 5min 44.1s
In this case everything works fine.
2) CHUNKING WITH METHOD .chunk
ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.
data2=xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
concat_dim='time',combine='nested') \
.chunk({'time': -1, 'longitude':400, 'latitude':200})
data2.t2m.data
As this image shows, apparently the chunking is now exactly the same than in 1). However...
with ProgressBar():
data2.std('time').compute()
[##################################### ] | 93% Completed | 1min 50.8s
...the computation of the std could not finish, the jupyter notebook kernel died without message due to exceeding the memory limit as I could checked monitoring with htop
... This likely implies that the chunking was indeed not taking place in reality and all the dataset without chunks is being loaded in to memory.
3) CHUNKING WHEN CALLING xarray.open_mfdataset
ALONG THE SPATIAL DIMENSIONS (longitude, latitude) AND LEAVING THE TIME DIMENSION CHUNKED BY DEFAULT (ONE CHUNK PER FILE).
In theory this case should be much slower that 1) since the computation of std
is done along the time dimension and thus much more chunks are generated unnecessarily (421420 chunks now vs 90 chunks in (1)).
data3 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
concat_dim='time', combine='nested',
chunks = {'longitude':400, 'latitude':200})
data3.t2m.data
with ProgressBar():
data3.std('time').compute()
[########################################] | 100% Completed | 5min 51.2s
However there is no memory problems and the amount of time required for the computation is almost the same than in case 1). This again suggests that method .chunk
seems to be not working properly.
Anyone knows if this makes sense or how to solve this issue? I would need to be able to change the chunking depending on the specific computation I need to do.
Thanks
PD: I'm using xarray version 0.15.1