13
votes

I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.

For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:

import netCDF4 as nc

f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]

But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.

Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.

2
If you want to do more with the data than just transposing it, have a look at the xarray module: It provides a very nice interface to dask out-of-memory arrays.j08lue

2 Answers

8
votes

You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:

http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html

The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles) you want for the variables. You can specify how much memory to use as a buffer and how much to use for chunk caches, but it's not clear how to use memory optimally between these uses, so you may have to just try some examples and time them. Rather than completely transpose a variable, you probably want to "partially transpose" it, by specifying chunks that have a lot of data along the 2 big dimensions of your slice and have only a few values along the other dimensions.

3
votes

This is a comment, not an answer, but I can't comment on the above, sorry.

I understand that you want to process myvar[:,:,i], with i in range(450). In that case, you are going to do something like:

for i in range(450):
    myslice = myvar[:,:,i]
    do_something(slice)

and the bottleneck is in accessing myslice = myvar[:,:,i]. Have you tried comparing how long it takes to access moreslices = myvar[:,:,0:n]? It would be contiguos data, and maybe you can save time with that. You would choose n as large as your memory affords it, and then process the next chunk of data moreslices = myvar[:,:,n:2n] and so on.