Convert / Cast StructType, ArrayType to StringType (Single Valued) using pyspark

Question

One of my Dataframe(spark.sql) has this schema.

root
 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |    |-- abc: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- a0: string (nullable = true)
 |    |    |    |-- a1: string (nullable = true)
 |    |    |    |-- a2: string (nullable = true)
 |    |    |    |-- a3: string (nullable = true)
 |-- ValueC: struct (nullable = true)
 |    |-- pqr: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- info1: string (nullable = true)
 |    |    |    |-- info2: struct (nullable = true)
 |    |    |    |    |-- x1: long (nullable = true)
 |    |    |    |    |-- x2: long (nullable = true)
 |    |    |    |    |-- x3: string (nullable = true)
 |    |    |    |-- info3: string (nullable = true)
 |    |    |    |-- info4: string (nullable = true)
 |-- Value4: struct (nullable = true)
 |    |-- xyz: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- b0: string (nullable = true)
 |    |    |    |-- b2: string (nullable = true)
 |    |    |    |-- b3: string (nullable = true)
 |-- Value5: string (nullable = true)

I need to save this to CSV file but without using any flatten, explode in the below format.

 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |-- ValueC: struct (nullable = true)
 |-- ValueD: struct (nullable = true)
 |-- ValueE: string (nullable = true)

I have Directly used the command [df.to_pandas().to_csv("output.csv")] this serves my purpose, but I need a better approach. I am using pyspark

notNull notNull · Accepted Answer · 2020-07-09T20:07:38

In Spark writing csv format doesn't support writing struct/array..etc complex types yet.

Write as Parquet file:

Better approach in Spark would be writing as parquet format, As parquet format supports all the nested data types and provides better performance while reading/writing.

df.write.parquet("<path>")

Write as Json file:

In case writing in json format accepted then

df.write.json("path")
#or
df.toJSON().saveAsTextFile("path")

Write as CSV file:

Use to_json function which converts json struct/Array to string and store as csv format.

df.selectExpr("valueA","to_json(ValueB)"..etc).write.csv("path")

Convert / Cast StructType, ArrayType to StringType (Single Valued) using pyspark

1 Answers