Unable to insert Pandas dataframes with NaN (or None) values into BigQuery tables when defining table_schema

Question

I am using pandas_gbq.to_gbq() to export a DataFrame to Google BigQuery with col1 which has NULL value.

>>>df
col1    day
apple   2019-03-01
None    2019-03-02
banana  2019-03-02
None    2019-03-03

>>>df.dtypes
col1   object
day    datetime64[ns]
dtype: object

Without defining the table schema, I am able to export a table in BigQuery successfully with null value in col1.

from google.cloud import bigquery
import pandas as pd
import pandas_gbq

pandas_gbq.to_gbq(df
        ,table_name
        ,project_id='project-dev'
        ,chunksize=None
        ,if_exists='replace'
        )

default table schema in BigQuery:

col1   STRING      NULLABLE
day    TIMESTAMP   NULLABLE

However, when I try to define day as DATE type in BigQuery since I don't want TIMESTAMP type, I encountered the error (I've tried NaN and None; both encountered errors).

table_schema = [{'name':'day', 'type':'DATE'}]

pandas_gbq.to_gbq(df
        ,table_name
        ,project_id='project-dev'
        ,chunksize=None
        ,if_exists='replace'
        ,table_schema=table_schema
        )

Error messages:

in df ,table_schema=table_schema File "/Users/xxx/anaconda3/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 1224, in to_gbq progress_bar=progress_bar, File "/Users/xxx/anaconda3/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 606, in load_data self.process_http_error(ex) File "/Users/xxx/anaconda3/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 425, in process_http_error raise GenericGBQException("Reason: {0}".format(ex)) pandas_gbq.gbq.GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.

I've read the documentation of pandas_gbq but I am still not able to figure out.

https://pandas-gbq.readthedocs.io/en/latest/api.html#pandas_gbq.to_gbq

Would someone be able to point me in the right direction? Thanks.

@WTK, according to the documentation, if you provide a string in a canonical DATE format, it will be read as DATE, here is the link. I have also ran some tests in a notebook with dummy data and It works well. I left the date field in the "YYYY-MM-DD" format and the string field with "None" values and it worked. The bigQuery schema in the UI was DATE and STRING. I can share my test with you. — Alexandre Moraes
@Sab Yes, I did try defining for all the columns but I got the same error. — WTK
@AlexandreMoraes Thank you for sharing that link. I changed the dtypes for [day] to string using df['day'].dt.strftime('%Y-%m-%d'), then I define the table schema as above then it works! — WTK
@WTK, I am glad to know it worked. I made a answer out of my comment to further help the community. I would appreciate if you can accept and upvote it. — Alexandre Moraes

Alexandre Moraes Alexandre Moraes · Accepted Answer · 2020-02-27T09:42:23

I am writing this answer out of the suggestion I provided in the comment section.

According to the documentation, if you provide a String in a canonical DATE format , it will be read as DATE in BigQuery. The canonical formats are as below:

YYYY: Four-digit year

[M]M: One or two digit month

[D]D: One or two digit day

Therefore, after changing the type and format as described above, you will be able to define your schema as desired or BigQuery will identify it as DATE.

As I mentioned in the comment I have ran some tests to confirm and exemplify what I suggested, I will share the code just to further help the community. I used Jupyter Notebook in the AI Platform to run the sample code below:

!pip install pandas_gbq

from google.cloud import bigquery 
import pandas as pd


table_schema = [{'name':'my_datetime', 'type':'DATE'},{'name':'my_string', 'type':'string'}]
df = pd.DataFrame(
    {
        "my_datetime": ["2020-01-01", "2020-01-01", "2020-01-01"],
        "my_string": ['a1',None, 'a3'],
    }
)

df.to_gbq(destination_table='data_frame.data_set', project_id='project_id', if_exists='replace')

I hope it helps.

Unable to insert Pandas dataframes with NaN (or None) values into BigQuery tables when defining table_schema

1 Answers