There is the streaming data solution that someone has already mentioned, but if you're trying to move a large block of logs data rather than set up a continuous stream, you may want to take the route of using asynchronous load jobs instead.
The GCS library acts like most python file libraries when used in Google App Engine, and can store files for import in cloud storage buckets:
import cloudstorage as gcs
filePath = "/CloudStorageBucket/dir/dir/logs.json"
with gcs.open(filePath, "w") as f:
f.write(SomeLogData)
f.close()
You can instruct Big Query to load a list of CSV or newline-delimited JSON files in Cloud Storage, by creating load jobs via the API: (Note: you will need to use oauth 2)
from apiclient.discovery import build
service = build("bigquery", "v2", http = oAuthedHttp)
job = {
"configuration": {
"load": {
"sourceUris": ["gs://CloudStorageBucket/dir/dir/logs.json"],
"schema": {
"files" : [
{"name": "Column1",
"type": "STRING"},
...
]
},
"destinationTable": {
"projectId": "Example-BigQuery-ProjectId",
"datasetId": "LogsDataset",
"tableId": "LogsTable"
},
"sourceFormat" : "NEWLINE_DELIMITED_JSON"
"createDisposition": "CREATE_IF_NEEDED"
}
}
}
response = service.jobs().insert(
projectId = "Example-BigQuery-ProjectId",
body = job
).execute()
You can read more about how to create Big Query load jobs if you want to set other properties like write disposition or skipping rows in a CSV file. You can also see other good examples of how to load data, including command line prompts.
Edit:
To answer your more specific questions:
Is this practical solution?
Yes. We export our Google App Engine logs to Cloud Storage and Import to BigQuery, using deferred tasks. Some have used map reduce jobs, but this can be overkill if you don't need to shuffle or reduce.
Structure of log data changes quite often this will cause an error
when insert to BigQuery.How we going to handle it in python script?
It shouldn't be an issue unless you're parsing the messages before they reach big query. A better design would be to port the messages, timestamps, levels etc. to Big Query and then digest it with queries there.
Incase, we have to rerun log data in particular period. How we can do that? need to write python script?
Streaming the data won't give you backups unless you set them up yourself in BigQuery. Using the method I outlined above will automatically give you back ups in Google Cloud Storage, which is preferred.
Know that BigQuery is an OLAP database, not transactional, so it's typically best to rebuild tables each time you add more log data, rather than try to insert new data. It's counter-intuitive, but BigQuery is designed for this, as it can import 10,000 files / 1TB at a time. Using pagination with job write disposition, you can in theory import hundreds of thousands of records fairly quickly. Streaming the data would be ideal if you don't care about having backup logs.