Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

Question

My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks.

I have worked out part of it and now stuck by another problem.

I have a small pyspark dataframe like :

  df1: 

   +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
   |topic|                                       termIndices|                                       termWeights|                                             terms|
   +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
   |    0|      [3, 155, 108, 67, 239, 4, 72, 326, 128, 189]|[0.023463344607734377, 0.011772322769900843, 0....|[cell, apoptosis, uptake, loss, transcription, ...|
   |    1|      [16, 8, 161, 86, 368, 153, 18, 214, 21, 222]|[0.013057307487199429, 0.011453455929929763, 0....|[therapy, cancer, diet, lung, marker, sensitivi...|
   |    2|            [0, 1, 124, 29, 7, 2, 84, 299, 22, 90]|[0.03979063871841061, 0.026593954837078836, 0.0...|[group, expression, performance, use, disease, ...|
   |    3|   [204, 146, 74, 240, 152, 384, 55, 250, 238, 92]|[0.009305626056223443, 0.008840730657888991, 0....|[pattern, chemotherapy, mass, the amount, targe...|

It has less than 100 rows and very small. Each term has a termWeight value in the column of "termWeights".

I have another large pyspark dataframe (50+ GB) like:

  df2: 
  +------+--------------------------------------------------+
  |r_id|                                    tokens|
  +------+--------------------------------------------------+
  |     0|[The human KCNJ9, Kir, GIRK3, member, potassium...|
  |     1|[BACKGROUND, the treatment, breast, cancer, the...|
  |     2|[OBJECTIVE, the relationship, preoperative atri...|

For each row in df2, I need to find best matching terms in df1 with the highest termWeights among all topics.

Finally, I need a df like

 r_id tokens topic (the topic in df1 that has the highest sum of termWeights among all topics)

I have defined a UDF (based on df2) but it cannot access the columns of df1. I am thinking how to use "cross join" for df1 and df2 but I do not need to join each row of df2 with each row of df1. I only need to keep all columns of df2 and add one column that is the "topic" with the highest sum of termWeights based on the matching terms of each df1's topic with the terms of each df2's row.

I am not sure how to implement this logic by pyspark.sql.functions.udf.

can the same term show in the same df2.tokens array twice, if so, should the weight of this term be doubled? — jxc
@jxc, No, each term in df1 is distinct in each topic and each term in df2 is also distinct in each row of df2. thanks! — user3448011

jxc jxc · Accepted Answer · 2020-09-07T03:15:00

IIUC, you can try something like the following (I split the processing flow into 4 steps, Spark 2.4+ is required):

Step-1: convert all df2.tokens to lowercase so we can do text comparison:

from pyspark.sql.functions import expr, desc, row_number, broadcast

df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))

Step-2: left-join df2 with df1 using arrays_overlap

df3 = df2.join(broadcast(df1), expr("arrays_overlap(terms, tokens)"), "left")

Step-3: use aggregate function to calculate matched_sum_of_weights from terms, termWeights and tokens

df4 = df3.selectExpr(
    "r_id",
    "tokens",
    "topic",
    """
      aggregate(
        /* find all terms+termWeights which are shown in tokens array */
        filter(arrays_zip(terms,termWeights), x -> array_contains(tokens, x.terms)),
        0D,
        /* get the sum of all termWeights from the matched terms */
        (acc, y) -> acc + y.termWeights
      ) as matched_sum_of_weights
    """)

Step-4: for each r_id, find the row with highest matched_sum_of_weights using Window function and only keep rows having row_number == 1

from pyspark.sql import Window
w1 = Window.partitionBy('r_id').orderBy(desc('matched_sum_of_weights'))

df_new = df4.withColumn('rn', row_number().over(w1)).filter('rn=1').drop('rn', 'matched_sum_of_weights')

Alternative: if the size of df1 is not very large, this might be handled without join/window.partition etc. below code only outlines the idea which you should improve based on your actual data:

from pyspark.sql.functions import expr, when, coalesce, array_contains, lit, struct

# create a dict from df1 with topic as key and list of termWeights+terms as value
d = df1.selectExpr("string(topic)", "arrays_zip(termWeights,terms) as terms").rdd.collectAsMap()

# ignore this if text comparison are case-sensitive, you might do the same to df1 as well
df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))

# save the column names of the original df2
cols = df2.columns

# iterate through all items of d(or df1) and update df2 with new columns from each 
# topic with the value a struct containing `sum_of_weights`, `topic` and `has_match`(if any terms is matched)
for x,y in d.items():
  df2 = df2.withColumn(x,
      struct(
        sum([when(array_contains('tokens', t.terms), t.termWeights).otherwise(0) for t in y]).alias('sum_of_weights'),
        lit(x).alias('topic'),
        coalesce(*[when(array_contains('tokens', t.terms),1) for t in y]).isNotNull().alias('has_match')
      )
  )

# create a new array containing all new columns(topics), and find array_max
# from items with `has_match == true`, and then retrieve the `topic` field
df_new = df2.selectExpr(
    *cols,
    f"array_max(filter(array({','.join(map('`{}`'.format,d.keys()))}), x -> x.has_match)).topic as topic"
)

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

1 Answers