I have a pyspark df who's schema looks like this
root
|-- company: struct (nullable = true)
| |-- 0: string (nullable = true)
| |-- 1: string (nullable = true)
| |-- 10: string (nullable = true)
| |-- 100: string (nullable = true)
| |-- 101: string (nullable = true)
| |-- 102: string (nullable = true)
| |-- 103: string (nullable = true)
| |-- 104: string (nullable = true)
| |-- 105: string (nullable = true)
| |-- 106: string (nullable = true)
| |-- 107: string (nullable = true)
| |-- 108: string (nullable = true)
| |-- 109: string (nullable = true)
I want the final format of this dataframe to look like this
id name
0 "foo"
1 "laa"
10 "bar"
100 "gee"
101 "yoo"
102 "koo"
instead of
0 1 10 100 101 102
"foo" "laa" "bar" "gee" "yoo" "koo"
which is what I get using 'col.*' expansion
I found the answer in this link How to explode StructType to rows from json dataframe in Spark rather than to columns
but that is scala spark and not pyspark. I am not familiar with the map reduce concept to change the script here to pyspark myself.
I am attaching a sample dataframe in similar schema and structure below..
from pyspark.sql import *
Employee = Row('employee1', 'employee2', 'employee3', 'employee4', 'employee5')
Salaries = Row('100000', '120000', '140000', '160000', '160000')
departmentWithEmployees1 = Row(employees=[Employee, Salaries])
departmentsWithEmployees_Seq = [departmentWithEmployees1]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)
dframe.show()
The structure of this dataframe is like this
root
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
| | |-- _3: string (nullable = true)
| | |-- _4: string (nullable = true)
| | |-- _5: string (nullable = true)
How I want my final dataframe is like this
Firstname Salary
employee1 10000
employee2 120000