2
votes

I have a 45 years of data named ds in the netCDF(.nc) format. It contains three coordinates: time, latitude and longitude.

print(ds)

<xarray.Dataset>
Dimensions:    (latitude: 106, longitude: 193, time: 403248)
Coordinates:
  * latitude   (latitude) float32 -39.2 -39.149525 ... -33.950478 -33.9
  * longitude  (longitude) float32 140.8 140.84792 140.89584 ... 149.95209 150.0
  * time       (time) datetime64[ns] 1972-01-01 ... 2017-12-31T23:00:00
Data variables:
    FFDI       (time, latitude, longitude) float32 dask.array<shape=(403248, 106, 193), chunksize=(744, 106, 193)>
Attributes:
    creationTime:        1525925611
    creationTimeString:  Wed May  9 21:13:31 PDT 2018
    Conventions:         COARDS

I need to calculate 95 percentile of FFDI by seasons, namely SON (Sep, Oct, Nov), DJF (Dec, Jan, Feb), MAM (Mar, Apr, May), JJA (Jun, Jul, Aug).

da_ffdi_95th = ds['FFDI'].reduce(np.percentile, dim='time', q=95)

This created a new DataArray object with percentile variables but the time dimension was dropped.

How can groupby be used with the np.percentile function?

1
Can you include an example DataSet? - Andy Hayden
Thanks. Unfortunately the nc file is over a few GBs. - alextc
an example, not the actual dataset... - Andy Hayden
Thanks. Can you please use the example dataset from xarray: ds = xr.tutorial.load_dataset('air_temperature')? This one gives two years of data (variable = air temperature). - alextc

1 Answers

1
votes

Believe it or not, I think you're most of the way there! See DataArrayGroupBy.reduce for more details.

da_ffdi_95th = ds['FFDI'].groupby('time.season').reduce(
    np.percentile, dim='time', q=95)

Since we are using a NumPy function however, the data will be loaded eagerly. To make this dask-compatible, the function we pass to reduce must be able to operate on NumPy or dask arrays. While dask implements a function that does this, dask.array.percentile, it only operates on 1D arrays, and is not a perfect match to the NumPy function.

Fortunately, with dask.array.map_blocks, it's easy enough to write our own. This uses the NumPy implementation of percentile and applies it to each chunk of the dask array; the only thing we need to be careful of is to make sure the array we apply it to is not chunked along the dimension we want to compute the percentile along.

import dask.array as dask_array

def dask_percentile(arr, axis=0, q=95):
    if len(arr.chunks[axis]) > 1:
        msg = ('Input array cannot be chunked along the percentile '
               'dimension.')
        raise ValueError(msg)
    return dask_array.map_blocks(np.percentile, arr, axis=axis, q=q,
                                 drop_axis=axis)

Then we can write a wrapper function that calls the appropriate percentile implementation depending on the type of the input array (either NumPy or dask):

def percentile(arr, axis=0, q=95):
    if isinstance(arr, dask_array.Array):
        return dask_percentile(arr, axis=axis, q=q)
    else:
        return np.percentile(arr, axis=axis, q=q)

Now if we call reduce, making sure to add the allow_lazy=True argument, this operation returns a dask array (if the underlying data is stored in a dask array and is appropriately chunked):

da_ffdi_95th = ds['FFDI'].groupby('time.season').reduce(
    percentile, dim='time', q=95, allow_lazy=True)