I have a PySpark DataFrame, df1, that looks like:
Customer1 Customer2 v_cust1 v_cust2
1 2 0.9 0.1
1 3 0.3 0.4
1 4 0.2 0.9
2 1 0.8 0.8
I want to take the cosine similarity of the two dataframes. And have something like that
Customer1 Customer2 v_cust1 v_cust2 cosine_sim
1 2 0.9 0.1 0.1
1 3 0.3 0.4 0.9
1 4 0.2 0.9 0.15
2 1 0.8 0.8 1
I have a python function that receives number/array of numbers like this:
def cos_sim(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
How can i create the cosine_sim column in my dataframe using udf? Can i pass several columns instead of one column to the udf cosine_sim function?