2
votes

We're seeing BigQuery produce invalid utf-8 errors when the " - " (dash) character is used in pipe delimited csv files. The weird thing is, these characters are in files that are over a year old, have not changed, and BigQuery has been reading the files for many months just fine until a few days ago. Here's an example of one of the errors.

Christus Trinity Clinic \\x96 Rheumatology is not a valid UTF-8 string

The way the string looks in the original file is like this:

Christus Trinity Clinic – Rheumatology

Does anyone know the fix for this or if BigQuery has changed it's functionality in a way that might cause this issue? I know that I can just upload a corrected file, but in this scenario the files are not supposed to change for auditing purposes.

2
How are you uploading these files to BQ? Directly from GCS? Host using Python client? Apache Beam? - Willian Fuks
These files are being stored in Google Storage and read by BigQuery there as an external table. - seth2958
If this looks like a bug, please share job ids and file locations on the bigquery issue tracker. Especially if these files have not been changed, but the behavior has. - Felipe Hoffa
We are facing exactly the same issues since tuesday 13 august. Clearly seems like a change of behavior of bq load csv files - Jean-Baptiste
I've opened a new issue item for this here: issuetracker.google.com/issues/139511264 - seth2958

2 Answers

1
votes

I had the same issue from aug 14. I am using gsutil to load the csv into bigquery.

I had used the encoding option while loading the csv and it is working for me.

Encoding:

--encoding ISO-8859-1

Command line:

bq --location=US load --skip_leading_rows=1 --encoding ISO-8859-1 --replace --source_format=CSV gcs.dim_employee
0
votes

We saw the same thing suddenly happen since yesterday.
For me, the solution was to add a encoding type to the loadconfig.
(I'm using the PHP client, but your client probably also has this option)

$loadConfig->encoding('ISO-8859-1');