Pyspark: How to Apply UDF only on Rows with NotNull Values

Question

I have a pyspark dataframe and would like to apply an UDF on a column with Null values.

Below is my dataframe:

+----+----+
|   a|   b|
+----+----+
|null|  00|
|.Abc|null|
|/5ee|  11|
|null|   0|
+----+----+

Below is the desired dataframe (remove punctuations and change string values to upper case in column a if row values are not Null):

+----+----+
|   a|   b|
+----+----+
|null|  00|
| ABC|null|
| 5EE|  11|
|null|   0|
+----+----+

Below is my UDF and code:

import pyspark.sql.functions as F
import re

remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x))
df = df.withColumn('a', F.when(F.col("a").isNotNull(), F.upper(remove_punct(F.col("a")))))

Below is the error:

TypeError: expected string or bytes-like object

Can you please suggest what would be the optimal solution the get the desired DF?

Thanks in advance!

mck mck · Accepted Answer · 2020-12-06T18:11:47

Use regexp_replace. No need for UDF.

df = df.withColumn('a', F.upper(F.regexp_replace(F.col('a'), '[^\w\s]', '')))

If you insist on using UDF, you need to do this:

remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x) if x is not None else None)
df = df.withColumn('a', F.upper(remove_punct(F.col("a"))))

Pyspark: How to Apply UDF only on Rows with NotNull Values

1 Answers