Iterators with DataSet in Spark 2.0

Question

How do I iterate over a DataSet in Spark 2.0 and scala? My problem is - I need to compare two rows. I need to compare DateN and DateN-1 and calculate the difference.

 Row1 - Date1 Num1 
 Row2 - Date2 Num2
 ..
 RowN- DateN NumN

does your df contain only two rows? if not what exactly do you want to answer given the data? pls elaborate more on the problem as there are planty methods available — elcomendante
No. That's just an example. My DS has many rows. As i mentioned above I need to compare two dates from two rows in a iteration in scala and find their difference. — coder AJ
You want "window functions". See, for example, databricks.com/blog/2015/07/15/… — The Archetypal Paul

Nikhil Bhide Nikhil Bhide · Accepted Answer · 2017-05-09T14:29:28

Not sure, whether you resolved issue using window function as you just want to compare n & n-1 rows and I dont see attribute on which you want to group the data. For your described requirement, you can resolve issue as follows:

Add index to the rdd using zipWithIndex.
Create rdd for odd indexed rows.
Create rdd for even index rows.
Now you can apply your logic on two rdds.1

Following is the working example :

 val spark = SparkSession
                    .builder
                    .appName("Example")
                    .master("local[*]")
                    .getOrCreate()
                    import spark.implicits._
    val customers = spark.sparkContext.parallelize(List(("Alice", "2016-05-01", 50.00),
                                        ("Alice", "2016-05-03", 45.00),
                                        ("Alice", "2016-05-04", 55.00),
                                        ("Bob", "2016-05-01", 25.00),
                                        ("Bob", "2016-05-04", 29.00),
                                        ("Bob", "2016-05-06", 27.00)))

   val custIndexed = customers.zipWithIndex().collect()
   val custOdd = custIndexed.filter(record=>record._2%2!=0)
   val custEven = custIndexed.filter(record=>record._2%2==0)

Iterators with DataSet in Spark 2.0

1 Answers