3
votes

I tried 2 approaches to import a large table in Google BigQuery, about 50,000,000 rows, 18GB, into dataframe to Google Datalab, in order to do the machine learning using Tensorflow.

Firstly I use (all modules needed are imported) :

data = bq.Query('SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME`').execute().result().to_dataframe()

Then it keeps Running... until forever. Even though I do LIMIT 1000000, it doesn't change.

Secondly I use:

data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')

It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:

Got page: 32; 45.0% done. Elapsed 293.1 s.

And I cannot find how to enable allowLargeResults in read_gbq(). As its document says, I try:

data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID', configuration = {'query': {'allowLargeResult': True}})

Then I get:

read_gbq() got an unexpected keyword argument 'configuration'

That's how I even failed to import 1,000,000 rows to Google Cloud Datalab. I actually want to import 50 times the data size.

Any idea about it?

Thanks

1

1 Answers

4
votes

Before loading large datasets into Google Cloud Datalab: Make sure to consider alternatives such as those mentioned in the comments of this answer. Use sampled data for the initial analysis, determine the correct model for the problem and then use a pipeline approach, such as Google Cloud Dataflow, to process the large dataset.

There is an interesting discussion regarding Datalab performance improvements when downloading data from BigQuery to Datalab here. Based on these performance tests, a performance improvement was merged into Google Cloud Datalab in Pull Request #339. This improvement does not appear to be mentioned in the release notes for Datalab but I believe that the fixes are included as part of Datalab 1.1.20170406. Please check the version of Google Cloud Datalab to make sure that you're running at least version 1.1.20170406. To check the version first click on the user icon in the top right corner of the navigation bar in Cloud Datalab then click About Datalab.

Regarding the pandas.read_gbq() command that appears to be stuck. I would like to offer a few suggestions:

  1. Open a new issue in the pandas-gbq repository here.
  2. Try extracting data from BigQuery to Google Cloud Storage in csv format, for example, which you can then load into a dataframe by using pd.read_csv. Here are 2 methods to do this:

Using Google BigQuery/Cloud Storage CLI tools: Using the bq command line tool and gsutil command line tool, extract data from BigQuery to Google Cloud Storage, and then Download the object to Google Cloud Datalab. To do this type bq extract <source_table> <destination_uris>, followed by gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/

Using Google Cloud Datalab

import google.datalab.bigquery as bq
import google.datalab.storage as storage
bq.Query(<your query>).execute(output_options=bq.QueryOutput.file(path='gs://<your_bucket>/<object name>', use_cache=False)).result()
result = storage.Bucket(<your_bucket>).object(<object name>).download()

Regarding the error read_gbq() got an unexpected keyword argument 'configuration', the ability to pass arbitrary key word arguments (configuration) was added in version 0.20.0. I believe this error is caused the fact that pandas is not up to date. You can check the version of pandas installed by running

import pandas
pandas.__version__

To upgrade to version 0.20.0, run pip install --upgrade pandas pandas-gbq. This will also install pandas-gbq which is an optional dependency for pandas.

Alternatively, you could try iterating over the table in Google Cloud Datalab. This works but its likely slower. This approach was mentioned in another StackOverflow answer here: https://stackoverflow.com/a/43382995/5990514

I hope this helps! Please let me know if you have any issues so I can improve this answer.

Anthonios Partheniou

Contributor at Cloud Datalab

Project Maintainer at pandas-gbq