I am trying to implement a custom explode in Pyspark. I have 4 columns that are arrays of structs with virtually the same schema (one columns structs contain one less field than the other three).
For each row in my DataFrame, I have 4 columns that are arrays of structs. The columns are students, teaching_assistants, teachers, administrators.
The students, teaching_assistants and teachers are arrays of structs with field id
, student_level
and name
.
For example, here is a sample row in the DataFrame.
The students, teaching_assistants and teachers structs all have the same schema ("id", "student_level", "name") and the administrators struct has the "id" and "name" fields but is missing the student level.
I want to perform a custom explode such that for every row I have one entry for each student, teaching assistant, professor and administrator along with the original column name in case I had to search by "person type". So for the screenshot of the row above, the output would be 8 rows:
+-----------+---------------------+----+---------------+----------+
| School_id | type | id | student_level | name |
+-----------+---------------------+----+---------------+----------+
| 1999 | students | 1 | 0 | Brian |
| 1999 | students | 9 | 2 | Max |
| 1999 | teaching_assistants | 19 | 0 | Xander |
| 1999 | teachers | 21 | 0 | Charlene |
| 1999 | teachers | 12 | 2 | Rob |
| 1999 | administrators | 23 | None | Marsha |
| 1999 | administrators | 11 | None | Ryan |
| 1999 | administrators | 14 | None | Bob |
+-----------+---------------------+----+---------------+----------+
For the administrators, the student_level column would just be null. The problem is if I use the explode function, I end up with all of these items in different columns.
Is it possible to accomplish this in Pyspark? One thought I had was to figure out how to combine the 4 array columns into 1 array and then do an explode on the array, although I am not sure if combining arrays of structs and getting the column names as a field is feasible (I've tried various things) and I also don't know if it would work if the administrators were missing a field.
In the past, I have done this by converting to RDD and using a flatmap/custom udf but it was very inefficient for millions of rows.