Steaming a large file into BigQuery

Question

I am trying to tidy up a large (8gb) .csv file in python then stream it into BigQuery. My code below starts off okay, as the table is created and the first 1000 rows go in, but then I get the error:

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

Is this perhaps related to the streaming buffer? My issue is that I will need to remove the table before i run the code again, otherwise the first 1000 entries will be duplicated due to the 'append' method.

import pandas as pd

destination_table = 'product_data.FS_orders'
project_id = '##'
pkey ='##'

chunks = []

for chunk in pd.read_csv('Historic_orders.csv',chunksize=1000, encoding='windows-1252', names=['Orderdate','Weborderno','Productcode','Quantitysold','Paymentmethod','ProductGender','DeviceType','Brand','ProductDescription','OrderType','ProductCategory','UnitpriceGBP' 'Webtype1','CostPrice','Webtype2','Webtype3','Variant','Orderlinetax']):
    chunk.replace(r' *!','Null', regex=True)
    chunk.to_gbq(destination_table, project_id, if_exists='append', private_key=pkey)
    chunks.append(chunk)

df = pd.concat(chunks, axis=0)

print(df.head(5))

pd.to_csv('Historic_orders_cleaned.csv')

It is possible that your csv file has some invalid characters, such as non utf 8 data. — searain
why are you streaming it? load job would work better with an existing csv — Felipe Hoffa
@FelipeHoffa Do you know if it's possible to batch with to_gbq? How can I change this? — Ben P

FKrauss FKrauss · Accepted Answer · 2018-04-07T13:27:24

Question: - why streaming and not simply loading? This way you can upload batches of 1 GB instead of 1000 rows. Streaming is usually the case when you do have continuous data that needs to be appended as they happen. If you have a break of 1 day between the collection of the data and the load job it's usually safer to just load it. see here.

apart from that. I've had my share of issues loading tables in bigQuery from csv files and most of the times it was either 1) encoding (I see you have non utf-8 encoding) and 2) invalid characters, some comma that was lost in the middle of the file that broke the line.

To validate that, what if you insert the rows backwards? do you get the same error?

Steaming a large file into BigQuery

1 Answers