0
votes

I have a requirement to load 100's of tables to BigQuery from Google Cloud Storage(GCS -> Temp table -> Main table). I have created a python process to load the data into BigQuery and scheduled in AppEngine. Since we have Maximum 10min timeout for AppEngine. I have submitted the jobs in Asynchronous mode and checking the job status later point of time. Since I have 100's of tables need to create a monitoring system to check the status the job load.

Need to maintain a couple of tables and bunch of views to check the job status.

The operational process is little complex. Is there any better way?

Thanks

1
could you please summarize what steps you do, maybe something can be improved.Pentium10
Could you please tell me step by step how did you create a python process in appengine and schedule it as I am looking to do the same in bigquery where I want to load csv files from google bucket into bigquery table using appengine using python?LondonUK

1 Answers

0
votes

When we did this, we simply used a message queue like Beanstalkd, where we pushed something that later had to be checked, and we wrote a small worker who subscribed to the channel and dealt with the task.

On the other hand: BigQuery offers support for querying data directly from Google Cloud Storage.

Use cases:
- Loading and cleaning your data in one pass by querying the data from a federated data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As a federated data source, the frequently changing data does not need to be reloaded every time it is updated.

https://cloud.google.com/bigquery/federated-data-sources