5
votes

We have an automated FTP process set up which imports a data file into Google Cloud Storage daily.

I would like to set up a daily automated job that uploads this csv into a bigquery table.

What is the best way to do this? My current first thought is to set up an app engine instance with a cron job that runs a python script every day. Is there a better solution?

2
This is the pipeliene/pattern you're looking for: File > GCS > Dataflow (template) > BigQueryGraham Polley

2 Answers

4
votes

Background Cloud Function with a Cloud Storage trigger is your best choice!

You can set it to monitor specific bucket for new files and execute load script whenever trigger is fired

Forgot to mention - Cloud Functions support (as of now) only node.js for scripting - which usually not a problem but just wanted to mention :o)

2
votes

Similar approach to Mikhail's answer - but with more code:

With Google Cloud Functions you can automate BigQuery each time you receive a new file:

Codewise, import BigQuery inside package.json:

{
  "dependencies": {
    "@google-cloud/bigquery": "^0.9.6"
  } 
}

And on index.js you can act on the new file in any appropriate way:

var bigQuery = BigQuery({ projectId: 'your-project-id' });

exports.processFile = (event, callback) => {
  console.log('Processing: ' + JSON.stringify(event.data));
  query(event.data);
  callback();
};

var BigQuery = require('@google-cloud/bigquery');

function query(data) {
    const filename = data.name.split('/').pop();
    const full_filename = `gs://${data.bucket}/${data.name}`

    // if you want to run a query:
    query = '...'
    bigQuery.query({
        query: query,
        useLegacySql: false
    });
};