0
votes

I have been running a Dataflow job using Python that utilizes the pandas library. It suddenly started failing with the following error:

File "/usr/local/lib/python2.7/dist-packages/pandas_gbq/auth.py", line 305, in _try_credentials client = bigquery.Client(project=project_id, credentials=credentials)

File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 161, in init self._connection = Connection(self, client_info=client_info)

File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/_http.py", line 33, in init super(Connection, self).init(client, client_info)

TypeError: init() takes exactly 2 arguments (3 given)

It is failing on this step:

import pandas as pd  
data = pd.read_gbq(query=query, project_id=project, dialect='standard', private_key=credentials)

My setup file looks like this:

install_requires=[
   'google-cloud-storage==1.11.0',
   'requests==2.19.1',
   'urllib3==1.23',
   'pandas-gbq==0.6.1',
   'pandas==0.23.4',
   'protobuf==3.6.0'
    ]

This is the same version that is on my local, where the code is working. No changes had been implemented to the job when it started failing. It runs successfully on local, but I see the issue when I run with the Dataflowrunner. I'm thinking this is a dependency issue. Are there documented issues with any of the package versions I'm using? Or are there specific package versions I need to add to my setup file?

1

1 Answers

0
votes

I had to add a BigQuery version to my setup file.

'google-cloud-bigquery==1.6.0'

According to Google documentation for Python SDK 2.5, the Dataflow worker have BigQuery 0.25.0 already installed. Since I previously was not specifying a version, I assume that's what my job was running. If there was an issue with that version of BigQuery, I'm still not sure why the error only recently started happening. Regardless, specifying 1.6.0 resolved the issue.