1
votes

I am developing an AI model with Tensorflow.js and Node.js. As part of this, I need to read and parse my large dataset in a streaming fashion (it's way too big to fit in memory all at the same time). This process results ultimately results in a pair of generator functions (1 for the input data, and another for the output data) that iteratively yield Tensorflow.js Tensors:

function* example_parser() {
    while(thereIsData) {
        // do reading & parsing here....
        
        yield next_tensor;
    }
}

....which are wrapped in a pair of tf.data.generator()s, followed by a tf.data.zip().

This process can be fairly computationally intensive at times, so I would like to refactor into a separate Node.js worker process / thread as I'm aware that Node.js executes Javascript in a single-threaded fashion.

However, I am also aware that if I were to transmit the data normally via e.g. process.send(), the serialisation / deserialisation would slow the process down so much that I'm better off keeping everything inside the same process.

To this end, my question is this:

How can I efficiently transmit (a stream of) Tensorflow.js Tensors between Node.js processes without incurring a heavy serialisation / deserialisation penalty?

1

1 Answers

2
votes

How can I efficiently transmit (a stream of) Tensorflow.js Tensors between Node.js ?

First a tensor cannot be send directly. A tensor object does not contain any data.

console.log(tensor) // will show info about the tensor but not the data it contains

Rather than transmitting the tensor object, its data can be sent:

// given a tensor t
// first get its data
const data = await t.data()
// and send it
worker.send({data})

In order to be able to reconstruct this tensor in the receiving process, the shape of the tensor needs to be send as well

worker.send({data, shape})

By default, the sending and receiving of messages between processes creates a copy of the initial data. If there are lots of data to be sent where the copy will incur a penalty to the system, it is possible to use SharedArrayBuffer which means a zero copy. However with the latter once the data is sent, it can no longer be used by the sending thread