Google Cloud Storage <-> Google App Engine -> Google BigQuery

Question

We are developing project which is to process our log data. The idea is

update log data from local logstash to Google Cloud Storage
write python script to insert job to import log data from Google Cloud Storage into Google BigQuery
write python script to process data in BigQuery itself

Note. for python script, we are thinking whether running on google app engine or google compute engine.

The questions are

Is this practical solution?
Structure of log data changes quite often this will cause an error when insert to BigQuery.How we going to handle it in python script?
Incase, we have to rerun log data in particular period. How we can do that?need to write python script?

Thanks

Jabberwockey Jabberwockey · Accepted Answer · 2014-09-10T01:30:19

There is the streaming data solution that someone has already mentioned, but if you're trying to move a large block of logs data rather than set up a continuous stream, you may want to take the route of using asynchronous load jobs instead.

The GCS library acts like most python file libraries when used in Google App Engine, and can store files for import in cloud storage buckets:

import cloudstorage as gcs

filePath = "/CloudStorageBucket/dir/dir/logs.json"
with gcs.open(filePath, "w") as f:
    f.write(SomeLogData)
    f.close()

You can instruct Big Query to load a list of CSV or newline-delimited JSON files in Cloud Storage, by creating load jobs via the API: (Note: you will need to use oauth 2)

from apiclient.discovery import build

service = build("bigquery", "v2", http = oAuthedHttp)

job = {
    "configuration": {
      "load": {
        "sourceUris": ["gs://CloudStorageBucket/dir/dir/logs.json"],
        "schema": {
            "files" : [
              {"name": "Column1",
               "type": "STRING"},
              ...
              ]
        },
        "destinationTable": {
          "projectId": "Example-BigQuery-ProjectId",
          "datasetId": "LogsDataset",
          "tableId": "LogsTable"
        },
        "sourceFormat" : "NEWLINE_DELIMITED_JSON"
        "createDisposition": "CREATE_IF_NEEDED"
      }   
    }
  }

response = service.jobs().insert(
        projectId = "Example-BigQuery-ProjectId",
        body = job
        ).execute()

You can read more about how to create Big Query load jobs if you want to set other properties like write disposition or skipping rows in a CSV file. You can also see other good examples of how to load data, including command line prompts.

Edit:

To answer your more specific questions:

Is this practical solution?

Yes. We export our Google App Engine logs to Cloud Storage and Import to BigQuery, using deferred tasks. Some have used map reduce jobs, but this can be overkill if you don't need to shuffle or reduce.

Structure of log data changes quite often this will cause an error when insert to BigQuery.How we going to handle it in python script?

It shouldn't be an issue unless you're parsing the messages before they reach big query. A better design would be to port the messages, timestamps, levels etc. to Big Query and then digest it with queries there.

Incase, we have to rerun log data in particular period. How we can do that? need to write python script?

Streaming the data won't give you backups unless you set them up yourself in BigQuery. Using the method I outlined above will automatically give you back ups in Google Cloud Storage, which is preferred.

Know that BigQuery is an OLAP database, not transactional, so it's typically best to rebuild tables each time you add more log data, rather than try to insert new data. It's counter-intuitive, but BigQuery is designed for this, as it can import 10,000 files / 1TB at a time. Using pagination with job write disposition, you can in theory import hundreds of thousands of records fairly quickly. Streaming the data would be ideal if you don't care about having backup logs.

Google Cloud Storage <-> Google App Engine -> Google BigQuery

2 Answers