14
votes

I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution.

There are 360 latitude points, 720 longitude points, and 1420 time points per year. I have both yearly files (12 GB ea) and one file with 110 years of data (1.3 TB) stored as netCDF-4 (here is an example of the 1901 data, 1901.nc, its use policy, and the original, public files that I started with).

From what I understood, it should be faster to read from one netCDF file rather than looping over the set of files separated by year and variable originally provided.

I want to extract a time series for each grid point, e.g. 10 or 30 years from a specific latitude and longitude. However, I am finding this to be very slow. As an example, it takes me 0.01 seconds to read in 10 values over time from a point location although I can read in a global slice of 10000 values from a single time point in 0.002 second (the order of the dimensions is lat, lon, time):

## a time series of 10 points from one location:
library(ncdf4)
met.nc <- nc_open("1901.nc")
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), 
                                             count = c(1,1,10)))
   user  system elapsed 
  0.001   0.000   0.090 

## close down session

## a global slice of 10k points from one time
library(ncdf4)
system.time(met.nc <- nc_open("1901.nc"))
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), 
                                             count = c(100,100,1)))
   user  system elapsed 
  0.002   0.000   0.002 

I suspect that these files have been written to optimize reading of spatial layers because a) the order of variables is lat, lon, time, b) that would be the logical order for the climate models that generated these files and c) because global extents are the most common visualization.

I have attempted to reorder variables so that time comes first:

ncpdq -a time,lon,lat 1901.nc 1901_time.nc

(ncpdq is from the NCO (netCDF operators) software)

> library(ncdf4)

## first with the original data set:
> system.time(met.nc <- nc_open("test/1901.nc"))
   user  system elapsed 
  0.024   0.045  22.334 
> system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000))
+ )
   user  system elapsed 
  0.005   0.027  14.958 

## now with the rearranged dimensions:
> system.time(met_time.nc <- nc_open("test/1901_time.nc"))
   user  system elapsed 
  0.025   0.041  16.704 
> system.time(a <- ncvar_get(met_time.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000)))
   user  system elapsed 
  0.001   0.019   9.660 

How can I optimize reading time series at a point rather than layers of large areas at one time point? For example, would it be faster if the files were written differently, such as time, lat, lon? Is there an "easy" way to transform the order of dimensions in a netCDF-4 file?

Update

(benchmarks requested by @mdsumner)

library(rbenchmark)
library(ncdf4)
nc <- nc_open("1901.nc")
benchmark(timeseries = ncvar_get(nc, "lwdown", 
                                 start = c(1, 1, 50), 
                                 count = c(10, 10, 100)), 
          spacechunk = ncvar_get(nc, "lwdown", 
                                  start = c(1, 1, 50), 
                                  count = c(100, 100, 1)),           
          replications = 1000)

        test replications elapsed relative user.self sys.self user.child
2 spacechunk         1000   0.909    1.000     0.843    0.066          0
1 timeseries         1000   2.211    2.432     1.103    1.105          0
  sys.child
2         0
1         0

Update 2:

I have started developing a solution here. The bits and pieces are in a set of scripts in github.com/ebimodeling/model-drivers/tree/master/met/cruncep

The scripts still need some work and organization - not all of the scripts are useful. But the reads are lightning quick. Not exactly comparable to the above results, but at the end of the day, I can read a 100 year, six-hourly time series from a 1.3TB file (0.5 degree resolution in 2.5s) instantly:

system.time(ts <- ncvar_get(met.nc, "lwdown", start = c(50, 1, 1), count = c(160000, 1, 1)))
   user  system elapsed 
  0.004   0.000   0.004 

(note: The order of dimensions have changed, as described here: How can I specify dimension order when using ncdf4::ncvar_get?)

3
count means that many cells from the start, so don't you mean count = c(1, 1, 10) for the first read?mdsumner
@mdsumner thanks for pointing this out. I have updated my question (with new, faster timings).David LeBauer

3 Answers

11
votes

I think the answer to this problem won't be so much re-ordering the data as it will be chunking the data. For a full discussion on the implications of chunking netCDF files, see the following blog posts from Russ Rew, lead netCDF developer at Unidata:

The upshot is that while employing different chunking strategies can achieve large increases in access speed, choosing the right strategy is non-trivial.

On the smaller sample dataset, sst.wkmean.1990-present.nc, I saw the following results when using your benchmark command:

1) Unchunked:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.841    1.000     0.812    0.029          0         0
## 1 timeseries         1000   1.325    1.576     0.944    0.381          0         0

2) Naively Chunked:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.788    1.000     0.788    0.000          0         0
## 1 timeseries         1000   0.814    1.033     0.814    0.001          0         0

The naive chunking was simply a shot in the dark; I used the nccopy utility thusly:

$ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc

The Unidata documentation for the nccopy utility can be found here.

I wish I could recommend a particular strategy for chunking your data set, but it is highly dependent on the data. Hopefully the articles linked above will give you some insight into how you might chunk your data to achieve the results you're looking for!

Update

The following blog post by Marcos Hermida shows how different chunking strategies influenced the speed when reading a time series for a particular netCDF file. This should only be used as perhaps a jumping off point.

In regards to rechunking via nccopy apparently hanging; the issue appears to be related to the default chunk cache size of 4MB. By increasing that to 4GB (or more), you can reduce the copy time from over 24 hours for a large file to under 11 minutes!

One point I'm not sure about; in the first link, the discussion is in regards to the chunk cache, but the argument passed to nccopy, -m, specifies the number of bytes in the copy buffer. The -m argument to nccopy controls the size of the chunk cache.

2
votes

EDIT: original question had a mistake, but there might also be different overheads for starting the read, so it's fair to do multiple reps. rbenchmark makes that easy.

The example file is a bit massive so I've used a smaller one, can you make the same comparison with your file?

More accessible example file: ftp://ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc

I get more like twice the time taken for a time series:

library(ncdf4)

nc <- nc_open("sst.wkmean.1990-present.nc")

library(rbenchmark)
benchmark(timeseries = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(10, 10, 100)), 
spacechunk = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(100, 100, 1)),           
replications = 1000)
##        test replications elapsed relative user.self sys.self user.child sys.child
##2 spacechunk         1000    0.47    1.000      0.43     0.03         NA        NA
##1 timeseries         1000    1.04    2.213      0.58     0.47         NA        NA
2
votes

Not sure if you have considered cdo to extract the point ?

cdo remapnn,lon=x/lat=y in.nc point.nc 

Sometimes CDO runs out of memory, if this happens, you might need to loop over the yearly files, and then cat the separate point files with

cdo mergetime point_${yyyy}.nc point_series.nc