i am "playing" with apache beam/dataflow in datalab. I am trying to read a csv file from gcs. when i create the pcollection using:
lines = p | 'ReadMyFile' >> beam.io.ReadFromText('gs://' + BUCKET_NAME + '/' + input_file, coder='StrUtf8Coder')
I get the following error:
LookupError: unknown encoding: "THE","NAME","OF","COLUMNS"
it seems the name of columns is interpreted as encoding?
I do not understand what's wrong. If i do not specify the "coder" i get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 1045: invalid continuation byte
Outside apache beam I am able to handle this error by reading the file from gcs:
blob = storage.Blob(gs_path, bucket)
data = blob.download_as_string()
data.decode('utf-8', 'ignore')
I read apache beam only support utf8 and the file does not contain only utf8.
Should I download and then convert to pcollection?
Any suggestion?