join two pyspark dataframe based on matching value(upto certain decimal point)

Question

I have two dataframe in pyspark: df1

+-------+--------+----------------+-------------+                               
|new_lat|new_long|        lat_long|   State_name|
+-------+--------+----------------+-------------+
|  33.64| -117.63|[33.64,-117.625] |STATE 1     |
|  23.45| -101.54|[23.45,-101.542] |STATE 2     |
+-------+--------+----------------+-------------+

df2

+---------+-----+--------------------+----------+------------+
|    label|value|            dateTime|       lat|        long|
+---------+-----+--------------------+----------+------------+
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5423864|
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5420964|

I want to join these two tables based on matching lat,long value upto 2 decimal. So the output dataframe I want is:

df3

+---------+-----+--------------------+----------+------------+------+
|    label|value|            dateTime|       lat|        long|state |
+---------+-----+--------------------+----------+------------+-------
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5423864|STATE 2
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5420964|STATE 2

How can I do this in an efficient way considering df2 has more than 100M rows.

I tried with df3=df1.join(df2, df1. new_lat == df2. lat, 'left') but not sure how can I consider upto two decimal in df1

@ScootCork I tried with this one but not sure how to consider upto 2 decimal in join statement df3 = df1.join(df2, df1. new_lat == df2. lat, 'left') — Schan
I guess the most direct approach is to round the df2.lat column to two decimal places and then join on that column. — ScootCork

notNull notNull · Accepted Answer · 2020-07-06T15:02:44

Use substring in your join condition.

df3=df1.join(df2, df1.new_lat == substring(df2.lat,1,5), 'left')

join two pyspark dataframe based on matching value(upto certain decimal point)

2 Answers