PySpark DataFrame - Filter nested column

Question

I know there are a lot of similar questions out there but I haven't found any that matches my scenario exactly so please don't be too trigger-happy with the Duplicate flag. I'm working in a Python 3 notebook in Azure Databricks with Spark 3.0.1.

I have the following DataFrame

+---+---------+--------+
|ID |FirstName|LastName|
+---+---------+--------+
|1  |John     |Doe     |
|2  |Michael  |        |
|3  |Angela   |Merkel  |
+---+---------+--------+

Which can be created with this code

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
import pyspark.sql.functions as F

data2 = [(1,"John","Doe"),
    (2,"Michael",""),
    (3,"Angela","Merkel")
  ]

schema = StructType([ \
    StructField("ID",IntegerType(),True), \
    StructField("FirstName",StringType(),True), \
    StructField("LastName",StringType(),True), \
  ])
 
df1 = spark.createDataFrame(data=data2,schema=schema)
df1.printSchema()
df1.show(truncate=False)

I transform it into this DataFrame

+---+-----------------------------------------+
|ID |Names                                    |
+---+-----------------------------------------+
|1  |[[FirstName, John], [LastName, Doe]]     |
|2  |[[FirstName, Michael], [LastName, ]]     |
|3  |[[FirstName, Angela], [LastName, Merkel]]|
+---+-----------------------------------------+

Using this code

df2 = df1.select(
            'ID', 
            F.array(
                F.struct(
                    F.lit('FirstName').alias('NameType'), 
                    F.col('FirstName').alias('Name')
                ), 
                F.struct(
                    F.lit('LastName').alias('NameType'), 
                    F.col('LastName').alias('Name')
                )
            ).alias('Names')
        )

df2.printSchema()
df2.show(truncate=False)

Now, I'm trying to filter out the Names where the LastName is null or is an empty string. My overall goal is to have an object that can be serialized in json where Names with an empty Name value are excluded.

Like this

[
    {
        "ID": 1,
        "Names": [
            {
                "NameType": "FirstName",
                "Name": "John"
            },
            {
                "NameType": "LastName",
                "Name": "Doe"
            }
        ]
    },
    {
        "ID": 2,
        "Names": [
            {
                "NameType": "FirstName",
                "Name": "Michael"
            }
        ]
    },
    {
        "ID": 3,
        "Names": [
            {
                "NameType": "FirstName",
                "Name": "Angela"
            },
            {
                "NameType": "LastName",
                "Name": "Merkel"
            }
        ]
    }
]

I have tried

df2 = df1.select(
            'ID', 
            F.array(
                F.struct(
                    F.lit('FirstName').alias('NameType'), 
                    F.col('FirstName').alias('Name')
                ), 
                F.struct(
                    F.lit('LastName').alias('NameType'), 
                    F.col('LastName').alias('Name')
                )
            ).filter(lambda x: x.col('LastName').isNotNull()).alias('Names')
        )

but I get the error 'Column' object is not callable.

I have also tried df2 = df2.filter(F.col('Names')['LastName']) > 0) but that gives me an invalid syntax error.

I have tried

df2 = df2.filter(lambda x: (len(x)>0), F.col('Names')['LastName'])

but that give the error TypeError: filter() takes 2 positional arguments but 3 were given.

Can someone please advise me on how to get this working?

mck mck · Accepted Answer · 2021-03-16T18:24:51

You can use the higher-order function filter:

import pyspark.sql.functions as F

df3 = df2.withColumn(
    'Names', 
    F.expr("filter(Names, x -> case when x.NameType = 'LastName' and length(x.Name) = 0 then false else true end)")
)

df3.show(truncate=False)
+---+-----------------------------------------+
|ID |Names                                    |
+---+-----------------------------------------+
|1  |[[FirstName, John], [LastName, Doe]]     |
|2  |[[FirstName, Michael]]                   |
|3  |[[FirstName, Angela], [LastName, Merkel]]|
+---+-----------------------------------------+

PySpark DataFrame - Filter nested column

1 Answers