0
votes

I have a data frame where I have done haversine distance calculation based on the latitude and longitude. I want to find the min.distance, id and store_code.

Dataframe looks like -

+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no|  dist_km|
+---+---------+---------+-----+-----+--------+---------+
|  1|13.031885|80.235574|29.91|73.88|      22| 1988.047|
|  1|13.031885|80.235574|28.57|77.33|      23| 1754.225|
|  1|13.031885|80.235574|26.86|80.95|      24|1539.8511|
|  2|19.099819|72.915288|29.91|73.88|      22|1206.3154|
|  3| 22.22698| 84.83607|29.91|73.88|      22|1387.3323|
|  2|19.099819|72.915288|28.57|77.33|      23|1144.7731|
|  2|19.099819|72.915288|26.86|80.95|      24|1191.7048|
|  3| 22.22698| 84.83607|28.57|77.33|      23|1032.1859|
|  3| 22.22698| 84.83607|26.86|80.95|      24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+

I want my final df should be - 

+---+---------+---------+-----+-----+--------+---------+
| id| user_lat| user_lon|s_lat|s_lon|store_no|  dist_km|
+---+---------+---------+-----+-----+--------+---------+
|  1|13.031885|80.235574|26.86|80.95|      24|1539.8511|
|  2|19.099819|72.915288|28.57|77.33|      23|1144.7731|
|  3| 22.22698| 84.83607|26.86|80.95|      24| 648.0673|
+---+---------+---------+-----+-----+--------+---------+
1

1 Answers

0
votes

This should work for you -

Create nearest_store_df with minimun distance.

import pyspark.sql.functions as psf

nearest_store_df = df\
    .groupBy('id')\
    .agg(psf.min('dist_km').alias('min_dist_km'))

Now join the nearest_store_df with original dataframe.

df\
    .join(nearest_store_df, df.dist_km == nearest_store_df.min_dist_km,'inner')\
    .show()