I'm working with sequential data of variable-size. Lets consider data like
Y = [ [.01,.02], [.03,.04], [.05,.06], [.07,.08], [.09,.1] ]
l = [ 3, 2 ]
where Y
is the result of some auxiliary calculation performed on my data and l
stores the length of the original sequences. In this example [.01,.02], [.03,.04], [.05,.06]
is thus the result of a calculation performed on the first sequence of the batch and [.07,.08], [.09,.1]
is the result of a calculation performed on the second sequence of the batch of lengths 3
and 2
respectively.
Now I would like to do some further calculations on the entries of Y
, but grouped by sequences. In Tensorflow there are functions as tf.math.segment_sum
which can be performed on a per-group basis.
Lets say I would like to sum using tf.math.segment_sum
. I would be interested in
seq_ids = [ 0, 0, 0, 1, 1 ]
tf.math.segment_sum(Y, segment_ids=seq_ids) #returns [ [0.09 0.12], [0.16 0.18] ]
The problem that I now face is to get seq_ids
from l
.
In numpy one would easily retrieve this by
seq_ids = np.digitize( np.arange(np.sum(l)), np.cumsum(l) )
It seems that there is a hidden (from the python api) equivalent of digitize
named bucketize
as mentioned in this search for a digitize
in Tensorflow.
But it seems that the refered hidden_ops.txt
has been removed from Tensorflow and it is unclear to me if there still is (and will be) support for the function tensorflow::ops::Bucketize
in the python api.
Another idea I had to get a similar result was to use the tf.train.piecewise_constant
function. But this attempt failed, as
seq_ids = tf.train.piecewise_constant(tf.range(tf.math.reduce_sum(l)), tf.math.cumsum(l), tf.range(BATCH_SIZE-1))
failed with object of type 'Tensor' has no len()
.
It seems that tf.train.piecewise_constant
isn't implemented in the most general way as the parameters boundaries
and values
need to be lists instead of tensors. As l
in my case is a 1-D tensor gathered in a minibatch of my tf.data.Dataset