We have data frame that contains binary columns, when we save the dataframe as csv, the binary column will cause issue for the csv parser.
Is there any way to force spark csv write to write out any binary column in hex or base64 encoded string?
You can check the column types, and if the type is binary, you can cast it to a hexadecimal string:
import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType
df_out = df.select([
F.hex(c.name).alias(c.name)
if isinstance(c.dataType, BinaryType)
else F.col(c)
for c in df.schema
])
df_out.write.csv('output', header=True)
You can check in df.dtypes whether type equal BinaryType then convert it to base64 string. In Scala you can write it like this:
val castedCols = df.dtypes.map { case (c, t) =>
if (t == "BinaryType") base64(col(c)).as(c) else col(c)
}
val df1 = df.select(castedCols:_*)
df1.write.csv(outputPath)