Long story short:
I initialize two lazy dask arrays and want to wrap it into an xarray DataArray. The dask arrays have different lengths, so I want to upsample the smaller ones with NaNs with the goal that both share the same xarray coordinate.
How can I do that computationally cheap (without looping over each sample) and with keeping dasks laziness?
Long story long:
Physically, the values of both dask arrays share the same time dimension (0s to 5seconds), but have totally different sampling frequencies (2MHz vs 3kHz). So the lengths (=shapes along time dimension) are very different.
Now I would love enable the power of xarray by letting both dask arrays really share the same time coordinate of the xarray.
The only way I can think of that, is to resample/upsample the smaller dask array with NaNs between each sample.
How can I achieve this? I am not sure if xarrays resampling [1] or resampling on the dask level can help me here.
[1] http://xarray.pydata.org/en/stable/generated/xarray.Dataset.resample.html
For simplification let's stay in 1D and with very short arrays in memory numpy arrays - in reality, the source comes from multiple huge hdf5 files:
import dask, xarray, numpy as np
long_source = np.ones(11)
short_source = np.ones(3)
time = np.linspace(0, 5, len(long_source))
da_long = dask.array.from_array(long_source)
da_short = dask.array.from_array(long_source)
# In best case, I find a way now to resample/fill da_short with NaNs
# between every sample to be able to stack both arrays!
# So an easy shortcut would be:
da_filler = dask.array.from_array(np.full(2, np.nan))
li_conc = [da_filler, da_short[0], da_filler, da_short[1], da_filler, da_short[2], da_filler, da_short[0]]
da_short = dask.array.concatenate(li_conc)
Here - of course - comes the "ValueError: all the input arrays must have same number of dimensions" as li_conc is a scalar and has no shape for this single item:
[dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
dask.array<getitem, shape=(), dtype=float64, chunksize=()>,
dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
dask.array<getitem, shape=(), dtype=float64, chunksize=()>,
dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
dask.array<getitem, shape=(), dtype=float64, chunksize=()>,
dask.array<array, shape=(2,), dtype=float64, chunksize=(2,)>,
dask.array<getitem, shape=(), dtype=float64, chunksize=()>]
# The rest of the pseudo code would be:
final_dask_array = dask.array.stack([da_long, da_short])
xr_data = xarray.DataArray(final_dask_array , coords=[time], dims=['time', 'dataset'])
Beside this manual concatenation is surely to slow for huge datasets, aboves approach would only work when concatenating after at least 2 samples.
So the wanted output should look like this, after a final_dask_array.compute():
[[ 1 , 1 , 1, 1 , 1 , 1, 1, 1 , 1, 1 , 1 ],
[nan, nan, 1, nan, nan, 1, nan, nan, 1, nan, nan]]
How can I achieve this?
I really do hope, I have described my problem in an understandable way. Thank you very much for your help and also about suggestions on how to improve my question I would be thankful.