2
votes

I'm working with sequential data of variable-size. Lets consider data like

Y = [ [.01,.02], [.03,.04], [.05,.06], [.07,.08], [.09,.1] ]
l = [ 3, 2 ]

where Y is the result of some auxiliary calculation performed on my data and l stores the length of the original sequences. In this example [.01,.02], [.03,.04], [.05,.06] is thus the result of a calculation performed on the first sequence of the batch and [.07,.08], [.09,.1] is the result of a calculation performed on the second sequence of the batch of lengths 3and 2 respectively. Now I would like to do some further calculations on the entries of Y, but grouped by sequences. In Tensorflow there are functions as tf.math.segment_sum which can be performed on a per-group basis.

Lets say I would like to sum using tf.math.segment_sum. I would be interested in

seq_ids = [ 0, 0, 0, 1, 1 ]
tf.math.segment_sum(Y, segment_ids=seq_ids) #returns [ [0.09 0.12], [0.16 0.18] ]

The problem that I now face is to get seq_ids from l. In numpy one would easily retrieve this by

seq_ids = np.digitize( np.arange(np.sum(l)), np.cumsum(l) )

It seems that there is a hidden (from the python api) equivalent of digitize named bucketize as mentioned in this search for a digitize in Tensorflow. But it seems that the refered hidden_ops.txt has been removed from Tensorflow and it is unclear to me if there still is (and will be) support for the function tensorflow::ops::Bucketize in the python api. Another idea I had to get a similar result was to use the tf.train.piecewise_constant function. But this attempt failed, as

seq_ids = tf.train.piecewise_constant(tf.range(tf.math.reduce_sum(l)), tf.math.cumsum(l), tf.range(BATCH_SIZE-1))

failed with object of type 'Tensor' has no len(). It seems that tf.train.piecewise_constant isn't implemented in the most general way as the parameters boundaries and values need to be lists instead of tensors. As l in my case is a 1-D tensor gathered in a minibatch of my tf.data.Dataset

2

2 Answers

2
votes

This is one way to do that:

import tensorflow as tf

def make_seq_ids(lens):
    # Get accumulated sums (e.g. [2, 3, 1] -> [2, 5, 6])
    c = tf.cumsum(lens)
    # Take all but the last accumulated sum value as indices
    idx = c[:-1]
    # Put ones on every index
    s = tf.scatter_nd(tf.expand_dims(idx, 1), tf.ones_like(idx), [c[-1]])
    # Use accumulated sums to generate ids for every segment
    return tf.cumsum(s)

with tf.Graph().as_default(), tf.Session() as sess:
    print(sess.run(make_seq_ids([2, 3, 1])))
    # [0 0 1 1 1 2]

EDIT:

You can also implement the same thing using tf.searchsorted, in a more similar way to what you proposed for NumPy:

import tensorflow as tf

def make_seq_ids(lens):
    c = tf.cumsum(lens)
    return tf.searchsorted(c, tf.range(c[-1]), side='right')

Neither of these implementations should be a bottleneck in a TensorFlow model, so for about any practical purpose it won't matter which one you choose. However, it is interesting to note that, in my particular machine (Win 10, TF 1.12, Core i7 7700K, Titan V), the second implementation is ~1.5x slower when running on CPU and ~3.5x faster when running on GPU.

0
votes

I think tf.ragged.row_splits_to_segment_ids is exactly what you are looking for. You just need tf.cumsum to convert length into cumulative length first. This can help you get seq_ids from l. The code will be like

Y = [ [.01,.02], [.03,.04], [.05,.06], [.07,.08], [.09,.1] ]
l = [ 3, 2 ]
l_cum = tf.math.cumsum(l)
l_cum = tf.concat([[0], l_cum], axis=0)
seq_ids = tf.ragged.row_splits_to_segment_ids(l_cum)
tf.math.segment_sum(Y, segment_ids=seq_ids)