I want to verify if an array contain a string in Pyspark (Spark < 2.4).
Example Dataframe:
column_1 <Array> | column_2 <String>
--------------------------------------------
["2345","98756","8794"] | 8794
--------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------
["3475","8956","45678"] | 3475
--------------------------------------------
I would like to compare the two columns column_1 and column_2. if column_1 contain column_2 I should skip it's value from column_1. I did an udf to soustract column_2 from column_1, but is not working:
def contains(x, y):
try:
sx, sy = set(x), set(y)
if len(sx) == 0:
return sx
elif len(sy) == 0:
return sx
else:
return sx - sy
# in exception, for example `x` or `y` is None (not a list)
except:
return sx
udf_contains = udf(contains, 'string')
new_df = my_df.withColumn('column_1', udf_contains(my_df.column_1, my_df.column_2))
Expect result:
column_1 <Array> | column_2 <String>
--------------------------------------------------
["2345","98756"] | 8794
--------------------------------------------------
["8756","45678","987563"] | 1234
--------------------------------------------------
["8956","45678"] | 3475
--------------------------------------------------
How can I do it knowing that sometimes / cases I have column_1 is [] and column_2 is null ? Thank you
udf_contains = udf(lambda x,y: [e for e in x if e != y], 'array<string>')
– jxcudf(lambda x,y: [e for e in x if e != y] if isinstance(x, list) else x
, 'array<string>') – jxc