Iterating a huge data frame in spark/scala

Question

I have a dataframe with 500 million rows. I would like to iterate through each row and modify the column names/drop few columns and also update the column values based on few conditions. I am using the below approach with collect.

df.collect.foreach(row => mycustomeMethod())

As collect will bring all the data to the driver i am faces out of memory errors.Can you please suggest any alternate ways of achieving the same.

We are using spark-cassandra connector by datastax. I tried different approaches but nothing that helped to improve the performance.

column name modification and column drop can be done with out collecting the dataframe. So your problem is updating the column values without collecting, right? — Tawkir
With 500 million rows the transformation will be slow with any approach you use. Did the other approaches work slowly, or did it crash too with "out of memory" errors? — Shankar P S
He is getting out memory, as my assumption is, he is applying transformations after collecting. So obviously the driver node can not process the whole 500 million at a time. — Tawkir

Davis Broda Davis Broda · Accepted Answer · 2017-03-14T21:03:15

Use a map operation instead of a collect/foreach, and convert back to RDD. That will allow the calculations to be distributed around the cluster, instead of forcing it all into one node. You can do this by modifying your custom method to take and return a Row, which can then be converted back to a DataFrame.

val oldSchema = originalDf.schema
val newSchema = //TODO: put new schema based on what you want to do
val newRdd = originalDf.map(row => myCustomMethod(row))
val newDf = sqlContext.createDataFrame(newRdd,newSchema)

Dropping rows can then be handled through the .drop method on the new DataFrame.

This may run into problems if your custom method is not serializable - or rather contains objects that are not serializable - in which case switch to a mapPartitions method, so that you can force each node to create a copy of the relevant objects first.

Iterating a huge data frame in spark/scala

1 Answers