How to load multiple csv files (different schemas) into bigquery

Question

I have 6,500 csv files with ~250 different schema's. i.e. These files are from the F.D.I.C (USA bank regulator) dataset. They have been uploaded to a google cloud storage bucket:

Each financial quarter has ~250 different csv's. Each csv, within a financial quarter, has a different schema:

There are ~250 unique schemas. The schema's repeat themselves, each financial quarter. The csv files go back 100 financial quarters to 1992:

Multiple CSV's,with the same schema, can be uploaded using a wild card. e.g. gs/path/*.csv. However each table name is not being auto generated from the file name. The UI requires a table name as an input:

How does one load multiple csv files with different schemas into bigquery?

khan khan · Accepted Answer · 2019-02-12T17:50:27

The way I would go about automating this is basically reading all the files from a given bucket (or its subfolder) and (making an assumption) using their "filename" to be the target tablename to ingest. Here is how:

gsutil ls gs://mybucket/subfolder/*.csv | xargs -I{} echo {} | awk '{n=split($1,A,"/"); q=split(A[n],B,"."); print "mydataset."B[1]" "$0}' | xargs -I{} sh -c 'bq --location=US load --replace=false --autodetect --source_format=CSV {}'

Make sure to replace location, mydataset with your desired values. Also, please take note of the following assumptions:

First row of each CSV is assumed to be the header, and thus is treated as column names.
We are writing with --replace=false flag, meaning data will be appended everytime you run the command. If you want to overwrite instead, just turn it to true and all tables' data will be over-written on each run.
CSV filenames (part before .csv is used as a tablename. You can modify the awk script to change it to any other alternative.

How to load multiple csv files (different schemas) into bigquery

1 Answers