My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks.
I have worked out part of it and now stuck by another problem.
I have a small pyspark dataframe like :
df1:
+-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|topic| termIndices| termWeights| terms|
+-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| 0| [3, 155, 108, 67, 239, 4, 72, 326, 128, 189]|[0.023463344607734377, 0.011772322769900843, 0....|[cell, apoptosis, uptake, loss, transcription, ...|
| 1| [16, 8, 161, 86, 368, 153, 18, 214, 21, 222]|[0.013057307487199429, 0.011453455929929763, 0....|[therapy, cancer, diet, lung, marker, sensitivi...|
| 2| [0, 1, 124, 29, 7, 2, 84, 299, 22, 90]|[0.03979063871841061, 0.026593954837078836, 0.0...|[group, expression, performance, use, disease, ...|
| 3| [204, 146, 74, 240, 152, 384, 55, 250, 238, 92]|[0.009305626056223443, 0.008840730657888991, 0....|[pattern, chemotherapy, mass, the amount, targe...|
It has less than 100 rows and very small. Each term has a termWeight value in the column of "termWeights".
I have another large pyspark dataframe (50+ GB) like:
df2:
+------+--------------------------------------------------+
|r_id| tokens|
+------+--------------------------------------------------+
| 0|[The human KCNJ9, Kir, GIRK3, member, potassium...|
| 1|[BACKGROUND, the treatment, breast, cancer, the...|
| 2|[OBJECTIVE, the relationship, preoperative atri...|
For each row in df2, I need to find best matching terms in df1 with the highest termWeights among all topics.
Finally, I need a df like
r_id tokens topic (the topic in df1 that has the highest sum of termWeights among all topics)
I have defined a UDF (based on df2) but it cannot access the columns of df1. I am thinking how to use "cross join" for df1 and df2 but I do not need to join each row of df2 with each row of df1. I only need to keep all columns of df2 and add one column that is the "topic" with the highest sum of termWeights based on the matching terms of each df1's topic with the terms of each df2's row.
I am not sure how to implement this logic by pyspark.sql.functions.udf.