I have a dataframe with 500 million rows. I would like to iterate through each row and modify the column names/drop few columns and also update the column values based on few conditions. I am using the below approach with collect.
df.collect.foreach(row => mycustomeMethod())
As collect will bring all the data to the driver i am faces out of memory errors.Can you please suggest any alternate ways of achieving the same.
We are using spark-cassandra connector by datastax. I tried different approaches but nothing that helped to improve the performance.