0
votes

I am trying to write a cloud function that loads GCS files that match a wildcard into BigQuery. All the samples i can see and in the api reference are load method which specify a specific filename via bigquery.dataset.table.load method.

What method can I use to trigger a file load using wildcards? I know i can specify the metadata source uri but cant see a load method example to go along. Any help appreciated.

Here is my code.

exports.importReadyToIngestFiles = (event, context) => {
  const pubsubMessage = event.data;
  console.log(Buffer.from(pubsubMessage, 'base64').toString());
  const bucketName = Buffer.from(pubsubMessage, 'base64').toString();

    const {BigQuery} = require('@google-cloud/bigquery');
  const {Storage} = require('@google-cloud/storage');

//specify projectID and bigquery datasetID below
  const projectId = "test-3";
  const datasetId = "Data";
  const filename = "p_*";

  const gcsFile = `${bucketName}/p_*`;  

    const tableId = "numeric";

  const bigquery = new BigQuery({
    projectId: projectId,
  });

  const storage = new Storage({
    projectId: projectId,
  });

  let job;

  // Loads data from a Google Cloud Storage file into the table
  bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename))
    .then(results => {
      job = results[0];
      console.log(`Job ${job.id} started.`);

      // Wait for the job to finish
      return job;
    })
    .then(metadata => {
      // Check the job's status for errors
      const errors = metadata.status.errors;
      if (errors && errors.length > 0) {
        throw errors;
      }
    })
    .then(() => {
      console.log(`Job ${job.id} completed.`);
    })
    .catch(err => {
      console.error('ERROR:', err);
    });

  // callback();

};

UPDATE: Aah, I figured out by looking at BigQuery job history error that my file is pipeline delimited so there was a column mismatch. Now, where can I find how to specify the delimiter? Is it just me or the nodejs bigquery api documentation is a bit unclear.

1
have you tried replacing the filename/string argument with the wildcard operator e.g. "gs://my_bucket/*.csv"? - Graham Polley
Yes. That did not pick up the files. I did not get an error either. - AIK DO
It should work. Are you sure the problem isn't something else? Have a look at the log for the actual BigQuery load job that is triggered and see what is says.. - Graham Polley
@AIKDO Did you solve? - Black_Bacardi

1 Answers

1
votes

As you mentioned in the initial question, I suppose that there is an option to supply sourceUris[] field inside the load Job within metadata parameters to leverage wildcard search and that was already confirmed in JobConfigurationLoad documentation:

sourceUris[]

The fully-qualified URIs that point to your data in Google Cloud. For Google Cloud Storage URIs: Each URI can contain one '*' wildcard character and it must come after the 'bucket' name.

Specifying metadata as input values for .load class, something like this:

const metadata = {
    sourceUris: [gs://my_bucket*],
    ...
  };

bigquery
    .load(metadata)

You can also adopt fieldDelimiter field for metadata in order to adjust a column separator for the input data file.

The source code for createLoadJob class you may find here.