I know there are a lot of similar questions out there but I haven't found any that matches my scenario exactly so please don't be too trigger-happy with the Duplicate flag. I'm working in a Python 3 notebook in Azure Databricks with Spark 3.0.1.
I have the following DataFrame
+---+---------+--------+
|ID |FirstName|LastName|
+---+---------+--------+
|1 |John |Doe |
|2 |Michael | |
|3 |Angela |Merkel |
+---+---------+--------+
Which can be created with this code
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
import pyspark.sql.functions as F
data2 = [(1,"John","Doe"),
(2,"Michael",""),
(3,"Angela","Merkel")
]
schema = StructType([ \
StructField("ID",IntegerType(),True), \
StructField("FirstName",StringType(),True), \
StructField("LastName",StringType(),True), \
])
df1 = spark.createDataFrame(data=data2,schema=schema)
df1.printSchema()
df1.show(truncate=False)
I transform it into this DataFrame
+---+-----------------------------------------+
|ID |Names |
+---+-----------------------------------------+
|1 |[[FirstName, John], [LastName, Doe]] |
|2 |[[FirstName, Michael], [LastName, ]] |
|3 |[[FirstName, Angela], [LastName, Merkel]]|
+---+-----------------------------------------+
Using this code
df2 = df1.select(
'ID',
F.array(
F.struct(
F.lit('FirstName').alias('NameType'),
F.col('FirstName').alias('Name')
),
F.struct(
F.lit('LastName').alias('NameType'),
F.col('LastName').alias('Name')
)
).alias('Names')
)
df2.printSchema()
df2.show(truncate=False)
Now, I'm trying to filter out the Names
where the LastName
is null or is an empty string.
My overall goal is to have an object that can be serialized in json where Names
with an empty Name
value are excluded.
Like this
[
{
"ID": 1,
"Names": [
{
"NameType": "FirstName",
"Name": "John"
},
{
"NameType": "LastName",
"Name": "Doe"
}
]
},
{
"ID": 2,
"Names": [
{
"NameType": "FirstName",
"Name": "Michael"
}
]
},
{
"ID": 3,
"Names": [
{
"NameType": "FirstName",
"Name": "Angela"
},
{
"NameType": "LastName",
"Name": "Merkel"
}
]
}
]
I have tried
df2 = df1.select(
'ID',
F.array(
F.struct(
F.lit('FirstName').alias('NameType'),
F.col('FirstName').alias('Name')
),
F.struct(
F.lit('LastName').alias('NameType'),
F.col('LastName').alias('Name')
)
).filter(lambda x: x.col('LastName').isNotNull()).alias('Names')
)
but I get the error 'Column' object is not callable
.
I have also tried df2 = df2.filter(F.col('Names')['LastName']) > 0)
but that gives me an invalid syntax
error.
I have tried
df2 = df2.filter(lambda x: (len(x)>0), F.col('Names')['LastName'])
but that give the error TypeError: filter() takes 2 positional arguments but 3 were given
.
Can someone please advise me on how to get this working?