0
votes

I have about 100GB data in BigQuery, and I'm fairly new to using data analysis tools. I want to grab about 3000 extracts for different queries, using a programmatic series of SQL queries, and then run some statistical analysis to compare kurtosis across those extracts.

Right now my workflow is as follows:

  • running on my local machine, use BigQuery Python client APIs to grab the data extracts and save them locally
  • running on my local machine, run kurtosis analysis over the extracts using scipy

The second one of these works fine, but it's pretty slow and painful to save all 3000 data extracts locally (network timeouts, etc).

Is there a better way of doing this? Basically I'm wondering if there's some kind of cloud tool where I could quickly run the calls to get the 3000 extracts, then run the Python to do the kurtosis analysis.

I had a look at https://cloud.google.com/bigquery/third-party-tools but I'm not sure if any of those do what I need.

3
I should say, I'm not wedded to BigQuery - could switch to another service e.g. Spark if that would be more suitable. - Richard
If you're going to vote to close this, please could you explain why, and suggest where I should ask the question instead? I'm asking it in good faith, and I would really appreciate some help, and I think it's a useful question for others to learn from. - Richard

3 Answers

2
votes

So far Cloud Datalab is your best option
https://cloud.google.com/datalab/
It is in beta so some surprises are possible
Datalab is built on top of below (Jupyter/IPython) option and totally in cloud

Another option is Jupyter/IPython Notebook
http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/

Our data sience team started with second option long ago with great success and now are moving toward Datalab

For the rest of the business (prod, bi, ops, sales, marketing, etc.), though, we had to build our own workflow/orchestration tool as nothing around was found good or relevant enough.

1
votes

two easy ways:

1: if your issue is network like you say, use a google compute engine machine to do the analisis, in the same zone as your bigquery tables (us, eu etc). it will not have network issues getting data from bigquery and will be super-fast. the machine will only cost you for the minutes you use it. save a snapshot of your machine to reuse the machine setup anytime (snapshot also has monthly cost but much lower than having the machine up.)

2: use Google cloud Datalab (beta as of dec. 2015) which supports bigquery sources and gives you all the tools you need to do the analysis and later share it with others: https://cloud.google.com/datalab/

from their docs: "Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of modules and a robust knowledge base. Cloud Datalab enables analysis of your data on Google BigQuery, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions)."

0
votes

You can check out Cooladata

It allows you to query BQ tables as external data sources. What you can do is either schedule your queries and export the results to Google storage, where you can pick up from there, or use the built in powerful reporting tool to answer your 3000 queries. It will also provide you all the BI tools you will need for your business.