I am writing a udf which will take two of the dataframe columns along with an extra parameter (a constant value) and should add a new column to the dataframe. My function looks like:
def udf_test(column1, column2, constant_var):
if column1 == column2:
return column1
else:
return constant_var
also, I am doing the following to pass in multiple columns:
apply_test = udf(udf_test, StringType())
df = df.withColumn('new_column', apply_test('column1', 'column2'))
This does not work right now unless I remove the constant_var
as my functions third argument but I really need that. So I have tried to do something like the following:
constant_var = 'TEST'
apply_test = udf(lambda x: udf_test(x, constant_var), StringType())
df = df.withColumn('new_column', apply_test(constant_var)(col('column1', 'column2')))
and
apply_test = udf(lambda x,y: udf_test(x, y, constant_var), StringType())
None of the above have worked for me. I got those ideas based on this and this stackoverflow posts and I think it is obvious how my question is different from both of the. Any help would be much appreciated.
NOTE: I have simplified the function here just for the sake of discussion and the actual function is more complex. I know this operation could be done using when
and otherwise
statements.
.when()
and.otherwise()
, right? – pvy4917