0
votes

I have a very large csv file, so i used spark and load it into a spark dataframe.
I need to extract the latitude and longitude from each row on the csv in order to create a folium map.
with pandas i can solve my problem with a loop:

for index, row in locations.iterrows():    

    folium.CircleMarker(location=(row["Pickup_latitude"],
                              row["Pickup_longitude"]),
                    radius=20,
                    color="#0A8A9F",fill=True).add_to(marker_cluster)

I found that unlike pandas data-frame the spark data-frame can't be processed by a loop =>how to loop through each row of dataFrame in pyspark .

so i thought that to i can engenieer the problem and cut the big data into hive tables then iterate them .

is it possible to cut the huge SPARK data-frame in hive tables and then iterate the rows with a loop?

1
Please use these guidelines to improve your question.Vladislav Varslavans

1 Answers

1
votes

Generally you don't need to iterate over DataFrame or RDD. You only create transformations (like map) that will be applied to each record and then call some action to call that processing.

You need something like:

dataframe.withColumn("latitude", <how to extract latitude>)
         .withColumn("longitude", <how to extract longitude>)
         .select("latitude", "longitude")
         .rdd
         .map(row => <extract values from Row type>)
         .collect()         // this will move data to local collection

In case if you can't do it with SQL, you need to do it using RDD:

dataframe
     .rdd
     .map(row => <create new row with latitude and longitude>)
     .collect()