0
votes

I have bunch of CSV files with bad column names, like "A.B/C" where I copied in GCP Bucket and try to load them to BQ from the console itself(cannot change the col names in source file). When I create the table from loading the first csv, BQ is renaming columns to "A_B_C" which is fine, but when I try to append 2nd CSV file to the table, it throws an error that "Could not parse '2019/08/14' as DATE for field A_B_C (position 0) starting at location 77". A_B_C is the first column in CSV that is why it is referring to the date. IMO it has nothing to do with the date. The format of date in CSV is YYYY-MM-DD which is inline with BQ requirements. I even changed the schema to modify A_B_C to STRING so in case there is any issue with date column is resolves, but it is still the same.

I also skip the 1st row in 2nd CSV load so it does not bother about column headers, but still no chance.

Any suggestion?

PS - obviously using *,? to load multiple CSV file at once always fail for obvious reason, I did not mention it in the question to avoid further confusion.

1
BQ import do not solve all special cases that can occur in data - in the case of such problems I recommend to use Dataprep to wrangle data and write some recipes to parse them. Be careful in the case of BigData (hundreds of GBs) it can increase your spending by tens of dollars. - Jozef Cechovsky
Just for clarification, the 1st file is uploaded correctly with the DATE format in the first column ( just changes the column name), right? Also, could you share the code you are using and some sample data so I can replicate the case. - Alexandre Moraes

1 Answers

1
votes

While you did not share the script you are using neither the language. I was able to reproduce your case loading more than one .csv file appending them to the same table using a Python script. I used the same column name and format type you mentioned in the question.

In order to achieve what I described above, I used the blobpackage inside the cloud.storage, so the script could go through all the .csv files with the same prefix. I also used job_config.autodetect = True and job_config.skip_leading_rows = 1 to autodetect the schema and skip the first row, which is the header. Below is the script:

from google.cloud import bigquery
from google.cloud.storage import Blob
from google.cloud import storage

client = bigquery.Client()
dataset_id = 'dataset_name'

client_bucket= storage.Client()
#bucket where the .csv files are stored
bucket = client_bucket.bucket('bucket_name')


dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#append to the table
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND


#it will loop and load all the .csv files with the same prefix(test_)
#loading them in the created table table_output
for blob in bucket.list_blobs(prefix='test_'):
    file_name = blob.name
    #uri location for the .csv file which will be uploaded
    uri = "gs://bucket_name/"+file_name
    load_job = client.load_table_from_uri(
        uri, dataset_ref.table("table_output"),            job_config=job_config)  # API request

    #checking the uploads 
    print("Starting job {}".format(load_job.job_id))

    load_job.result()  # Waits for table load to complete.
    print("Job finished.")

    destination_table = client.get_table(dataset_ref.table(file_name.split('.')[0]))
    print("Loaded {} rows.".format(destination_table.num_rows))

As for the documents, I used two sample .csv files:

test_1.csv

A.B/C,other_field
2020-06-09,test

test_2.csv

A.B/C,other_field
2020-06-10,test

Therefore, the script will go through the bucket and find all the .csv files with the prefix text_ and upload them to the same table. Notice that, I used the same format for the date field as well as the field name A.B/C. After running the script within Cloud Shell, the load was successful and the output:

enter image description here

and the schema,

enter image description here

As it is shown above, the field's name A.B/C was changed to A_B_C and the field format was automatically detected to DATE without any errors.