spark write csv file with binary columns

Question

We have data frame that contains binary columns, when we save the dataframe as csv, the binary column will cause issue for the csv parser.

Is there any way to force spark csv write to write out any binary column in hex or base64 encoded string?

mck mck · Accepted Answer · 2021-02-09T18:54:24

You can check the column types, and if the type is binary, you can cast it to a hexadecimal string:

import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType

df_out = df.select([
    F.hex(c.name).alias(c.name)
    if isinstance(c.dataType, BinaryType)
    else F.col(c)
    for c in df.schema
])

df_out.write.csv('output', header=True)

spark write csv file with binary columns

2 Answers