0
votes

I am trying to stream a file from a Google storage bucket through my Cloud Function to a new file in another bucket - my actual use case is to transform data in csv files but my example below removes that part for simplicity.

I have two buckets <bucket-name> and <bucket-name>-copy.

Code:


const util = require('util')
const stream = require('stream')
const pipeline = util.promisify(stream.pipeline);
const {Storage} = require('@google-cloud/storage')
const storage = new Storage()

exports.testStream = (event) => {

  const file = event;
  console.log(`Processing file: ${JSON.stringify(file)}`)
  const startDate = Date.now()

  async function run() {
    await pipeline(
      storage.bucket(file.bucket).file(file.name).createReadStream(),
      storage.bucket(file.bucket+'-copy').file(file.name).createWriteStream({gzip: true})
    )
    console.log('Pipeline complete. Time:', Date.now() - startDate, 'ms')
  }

  return run().catch(console.error)

}

I deploy the cloud function to the same region as the buckets: gcloud functions deploy testStream --runtime nodejs10 --region europe-west2 --trigger-resource <bucket-name> --trigger-event google.storage.object.finalize --memory=256MB

To trigger the function I copy a small 100 line csv file to the src bucket:

gsutil cp 100Rows.txt gs://<bucket-name>

If I run the function locally it executes immediately as expected, in fact I can stream a 1M row files in linear time, as you might expect. Yet the deployed cloud function above takes around 45s to copy this tiny file and larger files just never seem to complete. I notice as well that the pipeline success log is after the function execute ok log.


2020-04-22 20:20:40.496 BST
testStream1142856940990219Function execution started
2020-04-22 20:20:40.554 BST Processing file: {"bucket":"my-bucket","name":"100Rows.txt"} //removed rest of object for brevity
2020-04-22 20:20:40.650 BST Function execution took 155 ms, finished with status: 'ok'
2020-04-22 20:21:33.841 BST Pipeline succeeded. Time: 53286 ms

Any ideas on where I'm going wrong or is this a known limitation that I have overlooked? (I have looked a lot!)

Thanks

John

1
It looks like you're not dealing with JavaScript promises at all. Functions should return a promise that resolves when all the asynchronous work is complete. That's how Cloud Functions knows when it's safe to terminate the function and clean up. - Doug Stevenson
Thanks Doug! I did have a promise in there originally but I took it out when it still didn't work.....and the reason for that is that I didn't return the run() at the end of the function as I blindly copied the example in the node docs for stream.pipeline which obviously worked fine on the command line as it was just the last statement to be executed. I lost a few hours on this one and I feel a bit daft but that's fine as it now works and I am very grateful for your help! - John L
And I have now updated the example code in the question. - John L

1 Answers

1
votes

The solution came in three parts:

  1. Implement the promise as suggested by Doug
  2. Return the promise
  3. Increase the deploy option --memory=2048MB as this means we use a decent size processor as well - something I hadn't realised - and this stops the timeouts

I edited the code in my question but here it is again anyway:

const util = require('util')
const stream = require('stream')
const pipeline = util.promisify(stream.pipeline);
const {Storage} = require('@google-cloud/storage')
const storage = new Storage()

exports.testStream = (event) => {

  const file = event;
  console.log(`Processing file: ${JSON.stringify(file)}`)
  const startDate = Date.now()

  async function run() {
    await pipeline(
      storage.bucket(file.bucket).file(file.name).createReadStream(),
      storage.bucket(file.bucket+'-copy').file(file.name).createWriteStream({gzip: true})
    )
    console.log('Pipeline complete. Time:', Date.now() - startDate, 'ms')
  }

  return run().catch(console.error)

}

Deploy to gcp:

gcloud functions deploy testStream --runtime nodejs10 --region europe-west2 --trigger-resource --trigger-event google.storage.object.finalize --memory=2048MB