0
votes

We have data frame that contains binary columns, when we save the dataframe as csv, the binary column will cause issue for the csv parser.

Is there any way to force spark csv write to write out any binary column in hex or base64 encoded string?

2

2 Answers

0
votes

You can check the column types, and if the type is binary, you can cast it to a hexadecimal string:

import pyspark.sql.functions as F
from pyspark.sql.types import BinaryType

df_out = df.select([
    F.hex(c.name).alias(c.name)
    if isinstance(c.dataType, BinaryType)
    else F.col(c)
    for c in df.schema
])

df_out.write.csv('output', header=True)
0
votes

You can check in df.dtypes whether type equal BinaryType then convert it to base64 string. In Scala you can write it like this:

val castedCols = df.dtypes.map { case (c, t) =>
  if (t == "BinaryType") base64(col(c)).as(c) else col(c)
}

val df1 = df.select(castedCols:_*)

df1.write.csv(outputPath)