4
votes

I need to be able to quickly read lots of netCDF variables in python (1 variable per file). I'm finding that the Dataset function in netCDF4 library is rather slow compared to reading utilities in other languages (e.g., IDL).

My variables have shape of (2600,5200) and type float. They don't seem that big to me (filesize = 52Mb).

Here is my code:

import numpy as np
from netCDF4 import Dataset
import time
file = '20151120-235839.netcdf'
t0=time.time()
openFile = Dataset(file,'r')
raw_data = openFile.variables['MergedReflectivityQCComposite']
data = np.copy(raw_data)
openFile.close()
print time.time-t0

It takes about 3 seconds to read one variable (one file). I think the main slowdown is np.copy. raw_data is <type 'netCDF4.Variable'>, thus the copy. Is this the best/fastest way to do netCDF reads in python?

Thanks.

3
The power of Numpy is that you can create views into the exiting data in memory via the metadata it retains about the data. So a copy will always be slower than a view, via pointers. As @JCOidl says it's not clear why you don't just use raw_data = openFile.variables['MergedReflectivityQCComposite'][:]Eric Bridger
This simple step speeds up the read by an order of magnitude. Thank you! I'll try to leverage pointers withy Numpy more. Do you know of a good reference explaining this concept a bit more (n00b here)?weather guy
I'm not sure that it's faster in your case, but I would highly recommend using xarray - it handles gridded data at a higher level, and makes coding much nicer. It might also be faster, if you're dealing with large arrays. See stackoverflow.com/questions/47180126/… for discussion of performance.naught101

3 Answers

3
votes

The power of Numpy is that you can create views into the exiting data in memory via the metadata it retains about the data. So a copy will always be slower than a view, via pointers. As JCOidl says it's not clear why you don't just use:

 raw_data = openFile.variables['MergedReflectivityQCComposite'][:] 

For more info see SciPy Cookbook and SO View onto a numpy array?

2
votes

I'm not sure what to say about the np.copy operation (which is indeed slow), but I find that the PyNIO module from UCAR works well for both NetCDF and HDF files. This will place data into a numpy array:

import Nio

f = Nio.open_file(file, format="netcdf")
data = f.variables['MergedReflectivityQCComposite'][:]
f.close()

Testing your code versus the PyNIO code on a ndfCDF file I have resulted in 1.1 seconds for PyNIO, versus 3.1 seconds for the netCDF4 module. Your results may vary; worth a look though.

1
votes

You can use xarray for that.

%matplotlib inline 
import xarray as xr

### Single netcdf file ###
ds =  xr.open_dataset('path/file.nc')

### Opening multiple NetCDF files and concatenating them by time ####
ds = xr.open_mfdatset('path/*.nc', concat_dim='time

To read the variable you can simply type ds.MergedReflectivityQCCompositeor ds.['MergedReflectivityQCComposite'][:]

You can also use xr.load_dataset but I find that it uses up more space than the open function. For xr.open_mfdataset, you can also chunk along the dimensions of the file if you want. There are other options for both functions and you might be interested to learn more about it in the xarray documentation.