1
votes

I am switching to Big Query due to high performance. But have no idea about how to upload data from Google Cloud Storage to Big Query Database. Some more questions... Can i directly access my database from google cloud storage while using Big Query. Will i have to convert it first to some format. How i will keep updating Big Query database to my Google Cloud Storage Database.

Thanks In Advance.

2

2 Answers

8
votes

Assuming your data is in a supported format (delimited such as csv/tsv or json), you can easily load your data from Google Cloud Storage to BigQuery using either UI, CLI or API. For example, using the CLI:

bq load mydataset.mytable gs://my_bucket/file.csv name:string,gender:string,count:integer 

This will load file.csv from your Google Cloud Storage bucket 'my_bucket' into table 'mytable' under 'mydataset' dataset. The table will have three columns, - name and gender of type string and count of type integer. It might be useful for you to review the BigQuery Quick Start guide [1]

If you need to add more data, just run the bq load command again and by default, it will append the new rows from the CSV to the BigQuery's table. If you need to overwrite the data, add the --replace flag, so it will erase existing contents before loading new data

More, you can even run queries on your files in Google Cloud Storage without loading them first into BigQuery using External Tables [2]

[1] https://cloud.google.com/bigquery/bq-command-line-tool-quickstart

[2] https://cloud.google.com/bigquery/federated-data-sources

0
votes

Using Python U can update as :

import numpy as np
import uuid
from gcloud import bigquery


def load_data_from_gcs(dataset_name, table_name, source):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())
    if table.exists():
        table.delete()
    table.schema = (
        bigquery.SchemaField('ID', 'STRING'),
        bigquery.SchemaField('days', 'STRING'),
        bigquery.SchemaField('last_activ_date', 'STRING'),


    )

    table.create()
    job_name = str(uuid.uuid4())
    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

def wait_for_job(job):
    while True:
        job.reload()
        if job.state == 'DONE':
            if job.error_result:
                raise RuntimeError(job.errors)
            return
        time.sleep(1)   
if __name__ == "__main__":
    load_data_from_gcs('my_model','my_output', 'gs://path-uat/data_project/my_output.csv')