I am trying construct distinction matrix using spark and am confused how to do it optimally. I am new to spark. I have given a small example of what I'm trying to do below.
Example of distinction matrix construction:
Given Dataset D:
+----+-----+------+-----+
| id | a1 | a2 | a3 |
+----+-----+------+-----+
| 1 | yes | high | on |
| 2 | no | high | off |
| 3 | yes | low | off |
+----+-----+------+-----+
and my distinction table is
+-------+----+----+----+
| id,id | a1 | a2 | a3 |
+-------+----+----+----+
| 1,2 | 1 | 0 | 1 |
| 1,3 | 0 | 1 | 1 |
| 2,3 | 1 | 1 | 0 |
+-------+----+----+----+
i.e whenever an attribute ai is helpful in distinguishing a pair of tuples, distinction table has a 1, otherwise a 0.
My Datasets are huge and I trying to do it in spark.Following are approaches that came to my mind:
- using nested for loop to iterate over all members of RDD (of dataset)
- using cartesian() transformation over original RDD and iterate over all members of resultant RDD to get distinction table.
My questions are:
In 1st approach, does spark automatically optimize nested for loop setup internally for parallel processing?
In 2nd approach, using cartesian() causes extra storage overhead to store intermediate RDD. Is there any way to avoid this storage overhead and get final distinction table?
Which of these approaches is better and is there any other approach which can be useful to construct distinction matrix efficiently (both space and time)?
map,flatMap&filter, which all of them are distributed on Spark. - For your second question, in general yes a cartesian is expensive. However, if you use aDatasetor aDataFrameinstead of anRDDand after the join you perform somefilterit will be optimized. - Luis Miguel Mejía Suárez