0
votes

I have a requirement where I need to collect some columns onto Spark driver and some columns contain non-ascii characters. But while collecting them its gives error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 187: ordinal not in range(128).

Any idea how maybe I can apply udf to column content while fetching it and then collect it onto driver?

I am using PySpark for this.

1
How do you read the data? If you read them from a file you can define encoding as utf-8 on readMichail N
I am reading data from Hive.Avik Aggarwal

1 Answers

0
votes

I had the same problem. This worked for me :

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

I found it here https://chase-seibert.github.io/blog/2014/01/12/python-unicode-console-output.html