I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution.
There are 360 latitude points, 720 longitude points, and 1420 time points per year. I have both yearly files (12 GB ea) and one file with 110 years of data (1.3 TB) stored as netCDF-4 (here is an example of the 1901 data, 1901.nc, its use policy, and the original, public files that I started with).
From what I understood, it should be faster to read from one netCDF file rather than looping over the set of files separated by year and variable originally provided.
I want to extract a time series for each grid point, e.g. 10 or 30 years from a specific latitude and longitude. However, I am finding this to be very slow. As an example, it takes me 0.01 seconds to read in 10 values over time from a point location although I can read in a global slice of 10000 values from a single time point in 0.002 second (the order of the dimensions is lat, lon, time):
## a time series of 10 points from one location:
library(ncdf4)
met.nc <- nc_open("1901.nc")
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1),
count = c(1,1,10)))
user system elapsed
0.001 0.000 0.090
## close down session
## a global slice of 10k points from one time
library(ncdf4)
system.time(met.nc <- nc_open("1901.nc"))
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1),
count = c(100,100,1)))
user system elapsed
0.002 0.000 0.002
I suspect that these files have been written to optimize reading of spatial layers because a) the order of variables is lat, lon, time, b) that would be the logical order for the climate models that generated these files and c) because global extents are the most common visualization.
I have attempted to reorder variables so that time comes first:
ncpdq -a time,lon,lat 1901.nc 1901_time.nc
(ncpdq is from the NCO (netCDF operators) software)
> library(ncdf4)
## first with the original data set:
> system.time(met.nc <- nc_open("test/1901.nc"))
user system elapsed
0.024 0.045 22.334
> system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000))
+ )
user system elapsed
0.005 0.027 14.958
## now with the rearranged dimensions:
> system.time(met_time.nc <- nc_open("test/1901_time.nc"))
user system elapsed
0.025 0.041 16.704
> system.time(a <- ncvar_get(met_time.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000)))
user system elapsed
0.001 0.019 9.660
How can I optimize reading time series at a point rather than layers of large areas at one time point? For example, would it be faster if the files were written differently, such as time, lat, lon? Is there an "easy" way to transform the order of dimensions in a netCDF-4 file?
Update
(benchmarks requested by @mdsumner)
library(rbenchmark)
library(ncdf4)
nc <- nc_open("1901.nc")
benchmark(timeseries = ncvar_get(nc, "lwdown",
start = c(1, 1, 50),
count = c(10, 10, 100)),
spacechunk = ncvar_get(nc, "lwdown",
start = c(1, 1, 50),
count = c(100, 100, 1)),
replications = 1000)
test replications elapsed relative user.self sys.self user.child
2 spacechunk 1000 0.909 1.000 0.843 0.066 0
1 timeseries 1000 2.211 2.432 1.103 1.105 0
sys.child
2 0
1 0
Update 2:
I have started developing a solution here. The bits and pieces are in a set of scripts in github.com/ebimodeling/model-drivers/tree/master/met/cruncep
The scripts still need some work and organization - not all of the scripts are useful. But the reads are lightning quick. Not exactly comparable to the above results, but at the end of the day, I can read a 100 year, six-hourly time series from a 1.3TB file (0.5 degree resolution in 2.5s) instantly:
system.time(ts <- ncvar_get(met.nc, "lwdown", start = c(50, 1, 1), count = c(160000, 1, 1)))
user system elapsed
0.004 0.000 0.004
(note: The order of dimensions have changed, as described here: How can I specify dimension order when using ncdf4::ncvar_get?)