0
votes

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema. There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns. Moving all the csvs one-by-one from the bucket I devised the one responsible for that. This csv file doesn't have 10 lines... I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason. On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream. Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel. Then even with the missing line Big Query can then query the permanent table based on filename*.csvs. Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?

2

2 Answers

0
votes

Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires. Once all the rows are cleaned, you can then sink that data in BQ.

Dataprep is GUI based ETL tool and it runs a dataflow job internally.

Do let me know if any more clarification is required.

0
votes

Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.

According to documentation BigQuery expects CSV data to be UTF-8 encoded:

"encoding": "UTF-8"

In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.