1
votes

Long story short:

I initialize two lazy dask arrays and want to wrap it into an xarray DataArray. The dask arrays have different lengths, so I want to upsample the smaller ones with NaNs with the goal that both share the same xarray coordinate.

How can I do that computationally cheap (without looping over each sample) and with keeping dasks laziness?

Long story long:

Physically, the values of both dask arrays share the same time dimension (0s to 5seconds), but have totally different sampling frequencies (2MHz vs 3kHz). So the lengths (=shapes along time dimension) are very different.

Now I would love enable the power of xarray by letting both dask arrays really share the same time coordinate of the xarray.

The only way I can think of that, is to resample/upsample the smaller dask array with NaNs between each sample.

How can I achieve this? I am not sure if xarrays resampling [1] or resampling on the dask level can help me here.

[1] http://xarray.pydata.org/en/stable/generated/xarray.Dataset.resample.html

For simplification let's stay in 1D and with very short arrays in memory numpy arrays - in reality, the source comes from multiple huge hdf5 files:

import dask, xarray, numpy as np

long_source  = np.ones(11)
short_source = np.ones(3)
time = np.linspace(0, 5, len(long_source))

da_long  = dask.array.from_array(long_source)
da_short = dask.array.from_array(long_source)

# In best case, I find a way now to resample/fill da_short with NaNs
# between every sample to be able to stack both arrays!
# So an easy shortcut would be:

da_filler = dask.array.from_array(np.full(2, np.nan))
li_conc = [da_filler, da_short[0], da_filler, da_short[1], da_filler, da_short[2], da_filler, da_short[0]]

da_short = dask.array.concatenate(li_conc)

Here - of course - comes the "ValueError: all the input arrays must have same number of dimensions" as li_conc is a scalar and has no shape for this single item:

[dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
 dask.array<getitem, shape=(), dtype=float64, chunksize=()>,
 dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
 dask.array<getitem, shape=(), dtype=float64, chunksize=()>,
 dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
 dask.array<getitem, shape=(), dtype=float64, chunksize=()>,
 dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
 dask.array<getitem, shape=(), dtype=float64, chunksize=()>]
# The rest of the pseudo code would be:
final_dask_array = dask.array.stack([da_long, da_short])

xr_data = xarray.DataArray(final_dask_array , coords=[time], dims=['time', 'dataset'])

Beside this manual concatenation is surely to slow for huge datasets, aboves approach would only work when concatenating after at least 2 samples.

So the wanted output should look like this, after a final_dask_array.compute():

[[ 1 ,  1 , 1,  1 ,  1 , 1,  1,   1 , 1,  1 ,  1 ],
 [nan, nan, 1, nan, nan, 1, nan, nan, 1, nan, nan]]

How can I achieve this?

I really do hope, I have described my problem in an understandable way. Thank you very much for your help and also about suggestions on how to improve my question I would be thankful.

1

1 Answers

1
votes

Probably xarray.resample is what you want. Have a look at this code, which creates two xarray.DataArray and resamples them so that they can be compared:

da1 = xr.DataArray(np.random.randint(0,100,11),

              coords= [pd.date_range(start='14/09/2019 00:00:00',
                                     end=  '14/09/2019 00:00:05',
                                     periods=11)],
              dims='time')

da2 = xr.DataArray(np.random.randint(0,100,3),

              coords= [pd.date_range(start='14/09/2019 00:00:00',
                                     end=  '14/09/2019 00:00:05',
                                     periods=3)],
              dims='time')

da1_resampled = da1.resample(time='500ms').asfreq()
da2_resampled = da2.resample(time='500ms').asfreq()

da1 looks like:

<xarray.DataArray (time: 11)>
array([29,  6, 75,  8, 17, 28, 90, 28, 88, 48, 81])
Coordinates:
  * time     (time) datetime64[ns] 2019-09-14 ... 2019-09-14T00:00:05

da2 looks like:

<xarray.DataArray (time: 3)>
array([ 8, 53, 18])
Coordinates:
  * time     (time) datetime64[ns] 2019-09-14 ... 2019-09-14T00:00:05

da1_resampled looks like:

<xarray.DataArray (time: 11)>
array([87., 23., 88., 97., 14., 34., 80., 77., 63., 91., 94.])
Coordinates:
  * time     (time) datetime64[ns] 2019-09-14 ... 2019-09-14T00:00:05

da2_resampled looks like:

<xarray.DataArray (time: 11)>
array([ 8., nan, nan, nan, nan, 53., nan, nan, nan, nan, 18.])
Coordinates:
  * time     (time) datetime64[ns] 2019-09-14 ... 2019-09-14T00:00:05

Both da1_resampled and da2_resmpled have the same shape. You can continue to work with them as xarrays or access their data like so:

da1_resampled.data

Depending on how you want to further process your data, you could also interpolate the array instead of adding nans:

da1_resampled = da1.resample(time='500ms').interpolate('linear')

or

da1_resampled = da1.resample(time='500ms').interpolate('nearest')