How to handle non-ascii characters in content of columns while collecting using Spark SQL?

Question

I have a requirement where I need to collect some columns onto Spark driver and some columns contain non-ascii characters. But while collecting them its gives error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 187: ordinal not in range(128).

Any idea how maybe I can apply udf to column content while fetching it and then collect it onto driver?

I am using PySpark for this.

How do you read the data? If you read them from a file you can define encoding as utf-8 on read — Michail N

Eda Eda · Accepted Answer · 2019-10-10T13:22:32

I had the same problem. This worked for me :

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

I found it here https://chase-seibert.github.io/blog/2014/01/12/python-unicode-console-output.html

How to handle non-ascii characters in content of columns while collecting using Spark SQL?

1 Answers