0
votes

I have multiple netCDF files that eventually i want to merge. An example netCDF is as follows.

import xarray as xr
import numpy as np
import cftime

Rain_nc = xr.open_dataset('filepath.nc', decode_times=False)
print(Rain_nc)

<xarray.Dataset>
Dimensions: (land: 67209, tstep:248)
Dimensions without coordinates: land, tstep
Data variables:
    lon    (land) float32...
    lat    (land) float32...
    timestp(tstep) int32...
    time   (tstep) int32...
    Rainf  (tstep, land) float32...

the dimension 'land' is a count of numbers 1 to 67209, and 'tstep' is a count from 1 to 248.

the variable 'lat' and 'lon' are latitude and longitude values with a shape of (67209,)

the variable 'time' is the time in seconds since the start of the month (netcdf is a month long)

Next ive swapped the dimensions from 'tstep' to 'time', converted it for later merging and set coords as lon, lat and time.

rain_nc = rain_nc.swap_dims({'tstep':'time'})
rain_nc = rain_nc.set_coords(['lon', 'lat', 'time'])

rain_nc['time'] = cftime.num2date(rain_nc['time'], units='seconds since 2016-01-01 00:00:00', calendar = 'standard')
rain_nc['time'] = cftime.date2num(rain_nc['time'], units='seconds since 1970-01-01 00:00:00', calendar = 'standard')

this has left me with the following Dataset:

print(rain_nc)

<xarray.Dataset>
Dimensions: (land: 67209, time: 248)
Coordinates:
    lon        (land)float32
    lat        (land)float32
  * time       (time)float64
Dimensions without coordinates: land
Data variables:
    timestp   (time)int32
    Rainf     (time, land)


print(rain_nc['land'])
<xarray.DataArray 'land' (land: 67209)>
array([    0,    1,    2,..., 67206, 67207, 67208])
Coordinates:
    lon     (land) float32 ...
    lat     (land) float32 ...
Dimensions without coordinates: land

the Rainf variable i am interested in is as follows:

<xarray.DataArray 'Rainf' (time: 248, land: 67209)>
[16667832 values with dtype=float32]
Coordinates:
    lon      (land) float32 -179.75 -179.75 -179.75 ... 179.75 179.75 
179.75
    lat      (land) float32 71.25 70.75 68.75 68.25 ... -16.25 -16.75 
-19.25
  * time     (time) float64 1.452e+09 1.452e+09 ... 1.454e+09 1.454e+09
Dimensions without coordinates: land
Attributes:
    title:       Rainf
    units:       kg/m2s
    long_name:   Mean rainfall rate over the \nprevious 3 hours
    actual_max:  0.008114143
    actual_min:  0.0
    Fill_value:  1e+20

From here i would like to create a netCDF with the dimensions (time, lat, lon) and the variable Rainf.

I have tried creating a new netCDF (or alter this one) but when i try to pass the Rainf variable does not work as it has a shape of (248, 67209) and needs a shape of (248, 67209, 67209). Even though the current 'land' dimension of 'Rainf' has a lat and lon coordinate. Is it possible to reshape this variable to have a time, lat, and lon dimension?

1

1 Answers

1
votes

In the end it seems that what you want is to reshape the "land" dimensions to the ("lat", "lon") ones.

So, you have some DataArray similar to this:

# Setting sizes and coordinates
lon_size, lat_size = 50, 80                                                                                                                                                                           
lon, lat = [arr.flatten() for arr in np.meshgrid(range(lon_size), range(lat_size))]                                                                                                                   
land_size = lon_size * lat_size                                                                                                                                                                       
time_size = 100 

da = xr.DataArray( 
    dims=("time", "land"), 
    data=np.random.randn(time_size, land_size), 
    coords=dict( 
        time=np.arange(time_size), 
        lon=("land", lon), 
        lat=("land", lat), 
    ) 
)  

which looks like this:

>>> da
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
  * time     (time) int64 0 1 ... 98 99
    lon      (land) int64 0 1 ... 48 49
    lat      (land) int64 0 0 ... 79 79
Dimensions without coordinates: land

First, we'll use the .set_index() method to tell xarray that the "land" index should be represented from the "lon" and "lat" coordinates:

>>> da.set_index(land=("lon", "lat"))                                                                                                                                                                    
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
  * time     (time) int64 0 1 ... 98 99
  * land     (land) MultiIndex
  - lon      (land) int64 0 1 ... 48 49
  - lat      (land) int64 0 0 ... 79 79

The dimensions are still ("time", "land"), but now "land" is a MultiIndex.

Note that if you try to write to NETCDF at this point you'll have the following error:

>>> da.set_index(land=("lon", "lat")).to_netcdf("data.nc")   
NotImplementedError: variable 'land' is a MultiIndex, which cannot yet be serialized to netCDF files (https://github.com/pydata/xarray/issues/1077). Use reset_index() to convert MultiIndex levels into coordinate variables instead.

It tells you to use the .reset_index() method. But that's not what you want here, because it will just go back to the original da state.

What you want from now is to use the .unstack() method:

>>> da.set_index(land=("lon", "lat")).unstack("land")                                                                                                                                                    
<xarray.DataArray (time: 100, lon: 50, lat: 80)>
array([[[...]]])
Coordinates:
  * time     (time) int64 0 1 ... 98 99
  * lon      (lon) int64 0 1 ... 48 49
  * lat      (lat) int64 0 1 ... 78 79

It effectively kills the "land" dimension and gives the desired output.