2
votes

We are developing project which is to process our log data. The idea is

  • update log data from local logstash to Google Cloud Storage
  • write python script to insert job to import log data from Google Cloud Storage into Google BigQuery
  • write python script to process data in BigQuery itself

Note. for python script, we are thinking whether running on google app engine or google compute engine.

The questions are

  • Is this practical solution?
  • Structure of log data changes quite often this will cause an error when insert to BigQuery.How we going to handle it in python script?
  • Incase, we have to rerun log data in particular period. How we can do that?need to write python script?

Thanks

2

2 Answers

2
votes

There is the streaming data solution that someone has already mentioned, but if you're trying to move a large block of logs data rather than set up a continuous stream, you may want to take the route of using asynchronous load jobs instead.

The GCS library acts like most python file libraries when used in Google App Engine, and can store files for import in cloud storage buckets:

import cloudstorage as gcs

filePath = "/CloudStorageBucket/dir/dir/logs.json"
with gcs.open(filePath, "w") as f:
    f.write(SomeLogData)
    f.close()

You can instruct Big Query to load a list of CSV or newline-delimited JSON files in Cloud Storage, by creating load jobs via the API: (Note: you will need to use oauth 2)

from apiclient.discovery import build

service = build("bigquery", "v2", http = oAuthedHttp)

job = {
    "configuration": {
      "load": {
        "sourceUris": ["gs://CloudStorageBucket/dir/dir/logs.json"],
        "schema": {
            "files" : [
              {"name": "Column1",
               "type": "STRING"},
              ...
              ]
        },
        "destinationTable": {
          "projectId": "Example-BigQuery-ProjectId",
          "datasetId": "LogsDataset",
          "tableId": "LogsTable"
        },
        "sourceFormat" : "NEWLINE_DELIMITED_JSON"
        "createDisposition": "CREATE_IF_NEEDED"
      }   
    }
  }

response = service.jobs().insert(
        projectId = "Example-BigQuery-ProjectId",
        body = job
        ).execute()

You can read more about how to create Big Query load jobs if you want to set other properties like write disposition or skipping rows in a CSV file. You can also see other good examples of how to load data, including command line prompts.

Edit:

To answer your more specific questions:

Is this practical solution?

Yes. We export our Google App Engine logs to Cloud Storage and Import to BigQuery, using deferred tasks. Some have used map reduce jobs, but this can be overkill if you don't need to shuffle or reduce.

Structure of log data changes quite often this will cause an error when insert to BigQuery.How we going to handle it in python script?

It shouldn't be an issue unless you're parsing the messages before they reach big query. A better design would be to port the messages, timestamps, levels etc. to Big Query and then digest it with queries there.

Incase, we have to rerun log data in particular period. How we can do that? need to write python script?

Streaming the data won't give you backups unless you set them up yourself in BigQuery. Using the method I outlined above will automatically give you back ups in Google Cloud Storage, which is preferred.

Know that BigQuery is an OLAP database, not transactional, so it's typically best to rebuild tables each time you add more log data, rather than try to insert new data. It's counter-intuitive, but BigQuery is designed for this, as it can import 10,000 files / 1TB at a time. Using pagination with job write disposition, you can in theory import hundreds of thousands of records fairly quickly. Streaming the data would be ideal if you don't care about having backup logs.

3
votes

There is a new API for streaming data directly into BigQuery which may be a better match for your use case.

Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a load job. There are several important trade-offs to consider before choosing an approach.

If the structure of your data changes, you could have BigQuery run over its tables and update accordingly. Streaming the raw data will give you most flexibility but at the higher cost of having to post-process the data again.