Before loading large datasets into Google Cloud Datalab: Make sure to consider alternatives such as those mentioned in the comments of this answer. Use sampled data for the initial analysis, determine the correct model for the problem and then use a pipeline approach, such as Google Cloud Dataflow, to process the large dataset.
There is an interesting discussion regarding Datalab performance improvements when downloading data from BigQuery to Datalab here. Based on these performance tests, a performance improvement was merged into Google Cloud Datalab in Pull Request #339. This improvement does not appear to be mentioned in the release notes for Datalab but I believe that the fixes are included as part of Datalab 1.1.20170406. Please check the version of Google Cloud Datalab to make sure that you're running at least version 1.1.20170406. To check the version first click on the user icon in the top right corner of the navigation bar in Cloud Datalab then click About Datalab.
Regarding the pandas.read_gbq()
command that appears to be stuck. I would like to offer a few suggestions:
- Open a new issue in the
pandas-gbq
repository here.
- Try extracting data from BigQuery to Google Cloud Storage in csv format, for example, which you can then load into a dataframe by using
pd.read_csv
. Here are 2 methods to do this:
Using Google BigQuery/Cloud Storage CLI tools:
Using the bq
command line tool and gsutil
command line tool, extract data from BigQuery to Google Cloud Storage, and then Download the object to Google Cloud Datalab. To do this type bq extract <source_table> <destination_uris>
, followed by gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
Using Google Cloud Datalab
import google.datalab.bigquery as bq
import google.datalab.storage as storage
bq.Query(<your query>).execute(output_options=bq.QueryOutput.file(path='gs://<your_bucket>/<object name>', use_cache=False)).result()
result = storage.Bucket(<your_bucket>).object(<object name>).download()
Regarding the error read_gbq() got an unexpected keyword argument 'configuration'
, the ability to pass arbitrary key word arguments (configuration
) was added in version 0.20.0
. I believe this error is caused the fact that pandas
is not up to date. You can check the version of pandas installed by running
import pandas
pandas.__version__
To upgrade to version 0.20.0, run pip install --upgrade pandas pandas-gbq
. This will also install pandas-gbq
which is an optional dependency for pandas
.
Alternatively, you could try iterating over the table in Google Cloud Datalab. This works but its likely slower. This approach was mentioned in another StackOverflow answer here: https://stackoverflow.com/a/43382995/5990514
I hope this helps! Please let me know if you have any issues so I can improve this answer.
Anthonios Partheniou
Contributor at Cloud Datalab
Project Maintainer at pandas-gbq