Problem loading ISO-8859-1 into BigQuery using DataFlow (Apache Beam)

Question

I'm trying to load an ISO-8859-1 file into BigQuery using DataFlow. I've built a template with Apache Beam Java. Everything works well but when I check the content of the Bigquery table I see that some characters like 'ñ' or accents 'á','é', etc. haven't been stored propertly, they have been stored as �.

I've tried several charset changing before write into BigQuery. Also, I've created a special ISOCoder passed to the pipeline using the method setCoder(), but nothing works.

Does anyone know if is it possible to load into BigQuery this kind of files using Apache Beam? Only UTF-8?

Thanks in advance for your help.

Just an observation from my experience. I had a similar issue with cyrillic letter И (the only problematic one afaik) and charset conversion. I implemented a workaround. Before the processing I substitute И with a unique marker RUSSIAN_I - I need to be sure that during reverse conversion I don't spoil some user's text. After the processing in java I carry put this reverse conversion using a regex. — dimirsen

Juta Juta · Accepted Answer · 2019-07-23T13:10:04

This feature is currently not available in the Java SDK of Beam. In Python this seems to be possible by using the additional_bq_parameters when using WriteToBigQuery, see: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L177

Problem loading ISO-8859-1 into BigQuery using DataFlow (Apache Beam)

1 Answers