0
votes

Is there a recommended/efficient way to convert a tf.data.Dataset to a Tensor when the underlying 'data examples' in the Dataset are flat arrays?

I am using tf.data.csv to read and parse a CSV but then want to use the Tensorflow.js Core API to process the data as tf.Tensors.

2
The elements of the stream produced by tf.data.csv() are dicts with primitive values, i.e. numbers and strings. Those will be automatically converted to Tensors when you pass them to core API functions, so you don't need to worry about that. One special case is if you call batch() on your stream, in which case the resulting batches are already Tensors. - David Soergel

2 Answers

0
votes

tf.data.Dataset.iterator() returns a promise of an iterator.

const it = await flattenedDataset.iterator()
   const t = []
   // read only the data for the first 5 rows
   // all the data need not to be read once 
   // since it will consume a lot of memory
   for (let i = 0; i < 5; i++) {
        let e = await it.next()
      t.push(...e.value)
   }
  tf.concat(await t, 0)

Using for await of

const asyncIterable = {
  [Symbol.asyncIterator]() {
    return {
      i: 0,
      async next() {
        if (this.i < 5) {
          this.i++
          const e = await it.next()
          return Promise.resolve({ value: e.value, done: false });
        }

        return Promise.resolve({ done: true });
      }
    };
  }
};

  const t = []
  for await (let e of asyncIterable) {
        if(e) {
          t.push(e)
        }
   }

const csvUrl =
'https://storage.googleapis.com/tfjs-examples/multivariate-linear-regression/data/boston-housing-train.csv';

(async function run() {
   // We want to predict the column "medv", which represents a median value of
   // a home (in $1000s), so we mark it as a label.
   const csvDataset = tf.data.csv(
     csvUrl, {
       columnConfigs: {
         medv: {
           isLabel: true
         }
       }
     });

   // Number of features is the number of column names minus one for the label
   // column.
   const numOfFeatures = (await csvDataset.columnNames()).length - 1;

   // Prepare the Dataset for training.
   const flattenedDataset =
     csvDataset
     .map(([rawFeatures, rawLabel]) =>
       // Convert rows from object form (keyed by column name) to array form.
       [...Object.values(rawFeatures), ...Object.values(rawLabel)])
   			.batch(1)
  
	const it = await flattenedDataset.iterator()
  const asyncIterable = {
  [Symbol.asyncIterator]() {
    return {
      i: 0,
      async next() {
        if (this.i < 5) {
          this.i++
          const e = await it.next()
          return Promise.resolve({ value: e.value, done: false });
        }

        return Promise.resolve({ done: true });
      }
    };
  }
};
  
  const t = []
  for await (let e of asyncIterable) {
    	if(e) {
          t.push(e)
        }
   }
  console.log(tf.concat(t, 0).shape)
})()
<html>
  <head>
    <!-- Load TensorFlow.js -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/[email protected]"> </script>
  </head>

  <body>
  </body>
</html>
0
votes

Beware that this workflow is not typically recommended, because materializing all the data in the main JavaScript memory may not work for large CSV datasets.

You can use the toArray() method of tf.data.Dataset objects. For example:

  const csvUrl =
'https://storage.googleapis.com/tfjs-examples/multivariate-linear-regression/data/boston-housing-train.csv';

  const csvDataset = tf.data.csv(
     csvUrl, {
       columnConfigs: {
         medv: {
           isLabel: true
         }
       }
     }).batch(4);

  const tensors = await csvDataset.toArray();
  console.log(tensors.length);
  console.log(tensors[0][0]);