I have two dataframe in pyspark: df1
+-------+--------+----------------+-------------+
|new_lat|new_long| lat_long| State_name|
+-------+--------+----------------+-------------+
| 33.64| -117.63|[33.64,-117.625] |STATE 1 |
| 23.45| -101.54|[23.45,-101.542] |STATE 2 |
+-------+--------+----------------+-------------+
df2
+---------+-----+--------------------+----------+------------+
| label|value| dateTime| lat| long|
+---------+-----+--------------------+----------+------------+
|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508|
|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508|
|msg | 437|2019-04-06T05:10:...| 23.453622|-101.5423864|
|msg | 437|2019-04-06T05:10:...| 23.453622|-101.5420964|
I want to join these two tables based on matching lat,long value upto 2 decimal. So the output dataframe I want is:
df3
+---------+-----+--------------------+----------+------------+------+
| label|value| dateTime| lat| long|state |
+---------+-----+--------------------+----------+------------+-------
|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1
|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1
|msg | 437|2019-04-06T05:10:...| 23.453622|-101.5423864|STATE 2
|msg | 437|2019-04-06T05:10:...| 23.453622|-101.5420964|STATE 2
How can I do this in an efficient way considering df2 has more than 100M rows.
I tried with df3=df1.join(df2, df1. new_lat == df2. lat, 'left')
but not sure how can I consider upto two decimal in df1
df3 = df1.join(df2, df1. new_lat == df2. lat, 'left')
– Schandf2.lat
column to two decimal places and then join on that column. – ScootCork