Is it possible to join two rdds' values to avoid expensive shuffling?

Question

I have two RDDs both having two columns as (K,V). In the sources for those RDDs keys are appearing one under the other and for each row a different and distinct value is assigned to the keys. The text files to create RDDs are given at the bottom of this post.

Keys are totally different in both RDDs and I would like to join two RDDs based on their values and try to find how many common values exist for each pair. e.g. I am trying to reach a result such as (1-5, 10) meaning that a key value of "1" from RDD1 and a key value of "5" from RDD2 share 10 values in common.

I work on a single machine with 256 GB ram and 72 cores. One text file is 500 MB while the other is 3 MB.

Here is my code:

val conf = new SparkConf().setAppName("app").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
.set("spark.executor.memory","128g")
.set("spark.driver.maxResultSize", "0")

val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(0),s(1))}

val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}


val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)

        val joined = emp.mapPartitions(iter => for {
          (k, v1) <- iter
          v2 <- emp_newBC.value.getOrElse(v1, Iterable())
        } yield (s"$k-$v2", 1))

    joined.foreach(println)

val result = joined.reduceByKey((a,b) => a+b)

I try to manage this issue by using a broadcast variable as seen from the script. If I join RDD2 (having 250K rows) with itself pairs show up in the same partitions and so less shuffle takes place so it takes 3 minutes to get the results. However, when applying RDD1 vs. RDD2 the pairs are scattered through partitions resulting in very expensive shuffling procedure and it always ends up giving

ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 168591 ms error.

Based on my results:

Should I try to partition text file to create RDD1 in smaller chunks and join those smaller chunks separately with RDD2?
Is there another way of joining two RDDs based on their Value fields? If I describe the original values as keys and join them with the join function the value pairs are again scattered over the partitions which results in again a very expensive reducebykey operation. e.g.
```
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}

    val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
```
RDD1.join(RDD2).map(line => (line._2,1)).reduceByKey((a,b) => (a+b))

PSEUDO DATA SAMPLE:

A SMALL EXAMPLE

RDD1

RDD2

BASED ON THIS DATA JOIN RESULTS:

(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(3-22,1)
(4-21,1)
(2-21,1)
(3-21,1)
(3-22,1)
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(4-21,1)
(2-21,1)
(3-21,1)

REDUCEBYKEY RESULTS:

(4-21,1)
(3-21,1)
(2-21,3)
(3-22,3)

You should show an example using nano-data and tell me what is the content of spark-default.conf — Alberto Bonsanto
Where i work We have 4 computers of 8GB in a cluster, and we are able to read 4GB+ files without any problem, so i bet there is a problem in your $SPARK_HOME/conf/spark-defaults.conf, plus you didn't add the example I requested with the IN nano-data -> OUT nano-data (to make it clearer for us) — Alberto Bonsanto
I have added a small example. Yes Alberto i think my source must be enough to process this data. I was reading about tuning spark through spark.apache.org/docs/latest/tuning.html. I encounter data locality issue there. So my text files are present in a different driver. I mean data is in S driver where my spark is in D driver. Does that matter? Thanks. — mlee_jordan

Rohan Aletty Rohan Aletty · Accepted Answer · 2015-10-16T21:26:46

Have you looked at using a cartesian join? You could maybe try something like below:

val rdd1 = sc.parallelize(for { x <- 1 to 3; y <- 1 to 5 } yield (x, y)) // sample RDD
val rdd2 = sc.parallelize(for { x <- 1 to 3; y <- 3 to 7 } yield (x, y)) // sample RDD with slightly displaced values from the first

val g1 = rdd1.groupByKey()
val g2 = rdd2.groupByKey()

val cart = g1.cartesian(g2).map { case ((key1, values1), (key2, values2)) => 
             ((key1, key2), (values1.toSet & values2.toSet).size) 
           }

When I try running the above example in a cluster, I see the following:

scala> rdd1.take(5).foreach(println)
...
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
scala> rdd2.take(5).foreach(println)
...
(1,3)
(1,4)
(1,5)
(1,6)
(1,7)
scala> cart.take(5).foreach(println)
...
((1,1),3)
((1,2),3)
((1,3),3)
((2,1),3)
((2,2),3)

The result indicates that for (key1, key2), there are 3 matching elements between the sets. Note that the result is always 3 here since the initialized input tuples' ranges overlapped by 3 elements.

The cartesian transformation does not cause a shuffle either since it just iterates over the elements of each RDD and produces a cartesian product. You can see this by calling the toDebugString() function on an example.

scala> val carts = rdd1.cartesian(rdd2)
carts: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = CartesianRDD[9] at cartesian at <console>:25

scala> carts.toDebugString
res11: String =
(64) CartesianRDD[9] at cartesian at <console>:25 []
 |   ParallelCollectionRDD[1] at parallelize at <console>:21 []
 |   ParallelCollectionRDD[2] at parallelize at <console>:21 []

Is it possible to join two rdds' values to avoid expensive shuffling?

1 Answers