I have a pyspark dataframe:
number | matricule
--------------------------------------------
1 | ["AZ 1234", "1234", "00100"]
--------------------------------------------
23 | ["1010", "12987"]
--------------------------------------------
56 | ["AZ 98989", "22222", "98989"]
--------------------------------------------
In matricule
array, I have duplicates values if I remove AZ
String.
I would like to remove "AZ"
string then remove duplicates values in matricule
array. Knowing that sometimes I have a space just after AZ
, I should remove it also.
I did an udf:
def remove_AZ(A)
for item in A:
if item.startswith('AZ'):
item.replace('AZ','')
udf_remove_AZ = F.udf(remove_AZ)
df = df.withColumn("AZ_2", udf_remove_AZ(df.matricule))
I got null in all AZ_2
column.
How can I remove the AZ from the each value in matricule
array then removing the duplicates inside ?
Thank you