Apache Spark lookup function

Question

Reading def of lookup method from https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.PairRDDFunctions :

def
lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.

How can ensure that the RDD has a known partitioner ? I understand that an RDD is partitioned across node's in a cluster but what is meant by statement only searching the partition that the key maps to. ?

Justin Pihony Justin Pihony · Accepted Answer · 2015-05-07T14:27:53

A number of operations (especially on key-value pairs) automatically set up a partition when they are executed as it can increase efficiency by cutting down on network traffic. For example (From PairRDDFunctions):

def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
  }

Note the creation of a HashPartitioner. You can check the partitioner of your RDD if you want to see if it has one. You can also set one via partitionBy

Apache Spark lookup function

3 Answers