I'm trying to do a conditional explode in Spark Structured Streaming.
For instance, my streaming dataframe looks like follows (totally making the data up here). I want to explode the employees array into separate rows of arrays when contingent = 1
. When contingent = 0
, I need to let the array be as is.
|----------------|---------------------|------------------|
| Dept ID | Employees | Contingent |
|----------------|---------------------|------------------|
| 1 | ["John", "Jane"] | 1 |
|----------------|---------------------|------------------|
| 4 | ["Amy", "James"] | 0 |
|----------------|---------------------|------------------|
| 2 | ["David"] | 1 |
|----------------|---------------------|------------------|
So, my output should look like (I do not need to display the contingent
column:
|----------------|---------------------|
| Dept ID | Employees |
|----------------|---------------------|
| 1 | ["John"] |
|----------------|---------------------|
| 1 | ["Jane"] |
|----------------|---------------------|
| 4 | ["Amy", "James"] |
|----------------|---------------------|
| 2 | ["David"] |
|----------------|---------------------|
There are a couple challenges I'm currently facing:
- Exploding Arrays conditionally
- exploding arrays into arrays (rather than strings in this case)
In Hive, there was a concept of UDTF (user-defined table functions) that would allow me to do this. Wondering if there is anything comparable to it?