1
votes

Trying to use simple udf inside withColumn, i'm getting the below error.

Actually the order number is a string type and it has the value something like this below and it has no null values.

1234.345.344343434

code:

order_udf = udf(lambda x: x[x.rfind(".") + 1:], StringType())
df.withColumn("order_number", order_udf('order_number')).show()

Do i need to change something on udf call?

2
I was able to execute the above code and got ordernumber as 344343434. - notNull
i too got this when i execute in my local, but in the cluster its not working.. i hope this is something to do with DataType is not getting passed correctly - Shankar
ok.. is there specific reason behind using udf instead of regular spark functions? like regexp_replace (or) split? - notNull
@Shu: no, i thought of using this is easy? can we use regexp_replace for this? basically i need to get the last part of the order number - Shankar

2 Answers

2
votes

You can do without udf:(using split and element_at)

from pyspark.sql import functions as F
list=[['1234.345.344343434']]
df=spark.createDataFrame(list,['value'])
df.show()


+------------------+
|             value|
+------------------+
|1234.345.344343434|
+------------------+


df.withColumn("value", F.element_at(F.split("value",'\.'),-1)).show()

+---------+
|value    |
+---------+
|344343434|
+---------+
2
votes

We can split the order number on . then get the last element of array (or) we can use regexp_extract function to get only the order_number from last ..

Example:

from pyspark.sql.functions import *

df1=sc.parallelize([('1234.345.344343434',)]).toDF(["o"])
df1.withColumn("arr", split(col("o"),'\.')).selectExpr("arr[size(arr)-1] as order_number").show(10,False)

#starting from spark2.4+ we can use element_at function
df1.withColumn("arr", element_at(split(col("o"),'\.'),-1)).show(10,False)

#using regexp_extract function
df1.withColumn("order_number",regexp_extract(col('o'),'(?=[^.]+$)(.*)',1)).show()

#+------------+
#|order_number|
#+------------+
#|344343434   |
#+------------+