1
votes

I have two dataframe in pyspark: df1

+-------+--------+----------------+-------------+                               
|new_lat|new_long|        lat_long|   State_name|
+-------+--------+----------------+-------------+
|  33.64| -117.63|[33.64,-117.625] |STATE 1     |
|  23.45| -101.54|[23.45,-101.542] |STATE 2     |
+-------+--------+----------------+-------------+

df2

+---------+-----+--------------------+----------+------------+
|    label|value|            dateTime|       lat|        long|
+---------+-----+--------------------+----------+------------+
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5423864|
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5420964|

I want to join these two tables based on matching lat,long value upto 2 decimal. So the output dataframe I want is:

df3

+---------+-----+--------------------+----------+------------+------+
|    label|value|            dateTime|       lat|        long|state |
+---------+-----+--------------------+----------+------------+-------
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1
|msg      |  437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5423864|STATE 2
|msg      |  437|2019-04-06T05:10:...| 23.453622|-101.5420964|STATE 2

How can I do this in an efficient way considering df2 has more than 100M rows.

I tried with df3=df1.join(df2, df1. new_lat == df2. lat, 'left') but not sure how can I consider upto two decimal in df1

2
What have you tried so far?ScootCork
@ScootCork I tried with this one but not sure how to consider upto 2 decimal in join statement df3 = df1.join(df2, df1. new_lat == df2. lat, 'left')Schan
I guess the most direct approach is to round the df2.lat column to two decimal places and then join on that column.ScootCork

2 Answers

1
votes

Use substring in your join condition.

df3=df1.join(df2, df1.new_lat == substring(df2.lat,1,5), 'left')
0
votes

substring is definitely the easiest implementation, but won't always give you the accuracy that you might require (think int rounding on 0.5).

To get better accuracy you can just use a quick filter:

threshold = 0.01

df3 = (
    df1
    .join(df2)
    .filter(df1.new_lat - threshold < df2.lat)
    .filter(df2.lat < df1.new_lat + threshold)
)