1
votes

I'm working with a set of 468 netcdf files summing up to 12GB in total. Each file has only one global snapshot of a geophysical variable, i.e. for each file the data shape is (1, 1801, 3600) corresponding to dimensions ('time', 'latitude', 'longitude').

My RAM is 8GB so I need chunking. I'm creating a xarray dataset using xarray.open_mfdataset and I have found that using the parameter chunk when calling xarray.open_mfdataset or doing a rechunking after with method .chunk has totally different outcomes. A similar issues was reported here without getting any response.

From the xarray documentation, chunking when calling xarray.open_dataset or when rechunking with .chunk should be exactly equivalent...

http://xarray.pydata.org/en/stable/dask.html enter image description here

...but it doesn't seem so. I share here my examples.

1) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.

import xarray as xr

data1 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                          concat_dim='time', combine='nested',
                          chunks = {'longitude':400, 'latitude':200}) \
                         .chunk({'time':-1})
data1.t2m.data

enter image description here

with ProgressBar():
    data1.std('time').compute()

[########################################] | 100% Completed |  5min 44.1s

In this case everything works fine.

2) CHUNKING WITH METHOD .chunk ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.

data2=xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                        concat_dim='time',combine='nested') \
                       .chunk({'time': -1, 'longitude':400, 'latitude':200})
data2.t2m.data

enter image description here

As this image shows, apparently the chunking is now exactly the same than in 1). However...

with ProgressBar():
    data2.std('time').compute()

[#####################################   ] | 93% Completed |  1min 50.8s

...the computation of the std could not finish, the jupyter notebook kernel died without message due to exceeding the memory limit as I could checked monitoring with htop... This likely implies that the chunking was indeed not taking place in reality and all the dataset without chunks is being loaded in to memory.

3) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) AND LEAVING THE TIME DIMENSION CHUNKED BY DEFAULT (ONE CHUNK PER FILE).

In theory this case should be much slower that 1) since the computation of std is done along the time dimension and thus much more chunks are generated unnecessarily (421420 chunks now vs 90 chunks in (1)).

data3 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                          concat_dim='time', combine='nested',
                          chunks = {'longitude':400, 'latitude':200})
data3.t2m.data

enter image description here

with ProgressBar():
    data3.std('time').compute()

[########################################] | 100% Completed |  5min 51.2s

However there is no memory problems and the amount of time required for the computation is almost the same than in case 1). This again suggests that method .chunk seems to be not working properly.

Anyone knows if this makes sense or how to solve this issue? I would need to be able to change the chunking depending on the specific computation I need to do.

Thanks

PD: I'm using xarray version 0.15.1

1
I have tested using xr.open_dataset with one single and large netcdf file (instead of a set of hundreds of smaller netcdf files) and the same issue appears. Method .chunk does not seem to have a noticeable impact on the performance.susopeiz
I found something similar with chunking on open_mfdataset versus after. This first was slower I think but I couldn't really pin it down, plus I was just starting to learn xarray. My current approach is not to bother chunking during the open (as nothing's really read in yet?) and chunk after, especially when removing the time chunks .chunk({'time':-1}. That seemed to have a big effect for some reason. Then I use .persist() or .compute() which is when things really happen.Paul
So, @Paul you mean in your case using .chunk after calling open_mfdataset has a clear impact on performance when computing? I would be quite surprise if that is the case. The working solution I have found is the opposite; Simply forget about .chunk and create different datarray object calling open_mfdataset with different values passed to the chunk parameter within. This really seems to work fine in my case.susopeiz

1 Answers

0
votes

I would need to be able to change the chunking depending on the specific computation I need to do.

Yes, computations will be highly sensitive to chunk structure.

Chunking as early as possible in a computation (ideally when you're reading in data) is ideal because that makes the overall computation simpler.

In general I recommend larger chunk sizes. See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs