0
votes

I have a 4d dask array whose dimensions correspond to (time,depth,latitude,longitude). FYI, this is an oceanic data set.

# Create xarray dataset object for ocean temperature (time x depth x lat x lon)

DS=xr.open_mfdataset([outdir + s for s in flist],combine='by_coords',chunks={'xi_rho':25,'eta_rho':25})

DS.temp

Output:

xarray.DataArray
'temp' (ocean_time: 1456, s_rho: 50, eta_rho: 489, xi_rho: 655)
dask.array<chunksize=(1456, 50, 25, 25), meta=np.ndarray>

When I try to load a 1d-vector from this dask array, there are no issues.

T=DS.temp
%time
T.isel(ocean_time=0,eta_rho=100,xi_rho=500).values

Output (am omitting the values below):

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.96 µs

I now want to select only those (lat,lon) where the ocean floor is deeper than, say, 1000 m.

depth_max=1e3;
deep=np.where(DS.h >=depth_max); # DS.h is the depth of the ocean bottom.

# Number of locations where the ocean is deeper than depth_max
xy_num=len(deep[0])

This gives me a tuple 'deep' whose first element (deep[0]) is a list of all the 'eta_rho' (latitude index) values satisfying the condition. The second element of the tuple (deep[1]) is a list of the corresponding 'xi_rho' (longitude index) values. So, I can construct the (lat,lon) index pairs using (deep[0][0],deep[1][0]), (deep[0][1],deep[1][1]), (deep[0][2],deep[1][2]), (deep[0][3],deep[1][3]), etc.

This is great because I want to create a single coordinate that sweeps through (lat,lon) pairs satisfying the above condition. The goal is to get from (time,depth,lat,lon) -> (time,depth,gthmax) where gthmax is the new coordinate described abvoe. I do it like this:

# Picking only those (lat,lon) where the condition is satisfied
T_deep=T.isel(eta_rho=xr.DataArray(deep[0],dims='gthmax'),xi_rho=xr.DataArray(deep[1],dims='gthmax'))

# Explicitly assign the new coordinate
T_deep=T_deep.assign_coords({"gthmax":range(0,xy_num)})

# Create chunks along this new coordinate
T_deep=T_deep.chunk({'gthmax':1000})

T_deep

Output (just showing the dimensions):

xarray.DataArray 'temp' (ocean_time: 1456, s_rho: 50, gthmax: 133446)
dask.array<chunksize=(1456, 50, 1000), meta=np.ndarray>

Here is where the problem arises. When I try to access values from this new 3d dask array, even at a single point, the command below never finishes completion and I have to end up killing the kernel. I have also tried using load() and compute(), to no avail.

T_deep.isel(ocean_time=0,s_rho=46,gthmax=100).values

Any thoughts on where am I screwing up in the conversion from the original 4d dask array to the 3d dask array?

Thanks!

1

1 Answers

0
votes

I have no dataset to have a test, and it is hard to say by restricted information. Here is what my first guess.

When you are using the xr.open_mfdataset to load the information, you are actually assigning a new dimension "ocean_time" to the topography variable h. The dimension of "h" should be transferred from [eta_rho, xi_rho] to [ocean_time, eta_rho, xi_rho]. Therefore when working on

depth_max=1e3  
deep = np.where( DS.h >= depth_max)

the dimensions of deep are not [eta_rho, xi_rho]; in fact, they are also [ocean_time, eta_rho, xi_rho]. I guess, thus, the problem occurs here. You are pushing the coordinate system from [ocean_time, s_rho, eta_rho, xi_rho] to [ocean_time, s_rho, gthmax] by using

eta_rho=xr.DataArray(deep[0],dims='gthmax')  
xi_rho=xr.DataArray(deep[1],dims='gthmax')  

Notice that the length of the dimension "ocean_time" is 1456, where it is way larger than your horizontal dimension (eta_rho: 489, xi_rho:655). I believe this will confuse the xarray/dask and slow down the processes once you want to call the value.


Since I got your dataset, let me update my answer.

df = xr.open_mfdataset('./*.nc',  
                       combine='by_coords',  
                       chunks={'ocean_time':10}, 
                       parallel=True, 
)                                                                                                                 

temp = df[['temp']]                                                                                            
temp1 = temp.sel(xi_rho=xi_rho,eta_rho=eta_rho)                                                                   
%time temp1.isel(ocean_time=0,s_rho=46,z=100).compute()                                                          
CPU times: user 2.2 s, sys: 1.11 s, total: 3.31 s
Wall time: 3.05 s