16
votes

Loading a dataframe with foreign characters (åäö) into Spark using spark.read.csv, with encoding='utf-8' and trying to do a simple show().

>>> df.show()

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
print(self._jdf.showString(n, truncate))
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 579: ordinal not in range(128)

I figure this is probably related to Python itself but I cannot understand how any of the tricks that are mentioned here for example can be applied in the context of PySpark and the show()-function.

3
Do you experience this only when using show?zero323
@zero323 are there any other print-related commands that I could try?salient
For starters you can try if df.rdd.map(lambda x: x).count() succeeds.zero323
@zero323 – Yes, I have even successfully run some Spark SQL-queries — it's only this show()-function that fails on the encoding of the characters in strings.salient
So rdd.take(20) for example executes without a problem? If so the problem may be a header. One way or another can you provide a minimal data sample which can be used to reproduce the problem?zero323

3 Answers

28
votes

https://issues.apache.org/jira/browse/SPARK-11772 talks about this issue and gives a solution that runs:

export PYTHONIOENCODING=utf8

before running pyspark. I wonder why above works, because sys.getdefaultencoding() returned utf-8 for me even without it.

How to set sys.stdout encoding in Python 3? also talks about this and gives the following solution for Python 3:

import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
5
votes
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

This works for me, I am setting the encoding upfront and it is valid throughout the script.

0
votes

I faced the same issue with the following version of Spark and Python:

SPARK - 2.4.0

Python - 2.7.5

None of the above solutions worked for me.

For me, the issue was happening while trying to save the result RDD to HDFS location. I was taking the input from HDFS location and saving the same to HDFS location. Following was the code used for the read and write operations when this issue came up:

Reading input data:

monthly_input = sc.textFile(monthly_input_location).map(lambda i: i.split("\x01"))
monthly_input_df = sqlContext.createDataFrame(monthly_input, monthly_input_schema)

Writing to HDFS:

result = output_df.rdd.map(tuple).map(lambda line: "\x01".join([str(i) for i in line]))
result.saveAsTextFile(output_location)

I changed the reading and writing code respectively to below code:

Reading code:

monthly_input = sqlContext.read.format("csv").option('encoding', 'UTF-8').option("header", "true").option("delimiter", "\x01").schema(monthly_input_schema).load(monthly_input_location)

Writing Code:

output_df.write.format('csv').option("header", "false").option("delimiter", "\x01").save(output_location)

Not only this solved the issue, it improved the IO performance by a great deal(Almost 3 times).

But there are one known issue while using the write logic above, which I am yet to figure out a proper solution. If there are blank field in output, due to the CSV encoding, it will show the blank value enclosed in double quotes("").

For me that issue is currently not a big deal. I am loading the output to hive anyway and there the double quotes can be removed while importing itself.

PS: I am still using SQLContext. Yet to upgrade to SparkSession. But from what I tried so far similar read and write operation in SparkSession based code also will work similarly.