map vs filter in Apache Spark

Question

From official documentation for Apache Spark:

http://spark.apache.org/docs/latest/rdd-programming-guide.html

map(func):Return a new distributed dataset formed by passing each element of the source through a function func.

filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.

Going by bold words, is it a big difference?And is it really a difference?

Yuval Itzchakov Yuval Itzchakov · Accepted Answer · 2018-01-06T17:37:29

They serve different purposes. If we look at the (simplified) method definition for map:

def map[U](func: (T) ⇒ U): Dataset[U]

expects that given an element of type T, you yield an element of type U, forall T, resulting in a Dataset[U]. In other words, a means of transforming an element of type T to type U.

On the other hand, filter:

def filter(func: (T) ⇒ Boolean): Dataset[T]

expects that given an element of type T, you provide a Boolean value which says if that given element should be yielded back in the resulting Dataset[T] or not (often refereed to as a Predicate).

A concrete example of map can be:

val someDataSet: DataSet[String] = ???
val transformedDataSet: DataSet[Int] = someDataSet.map(str => str.toInt)

And for filter:

val someDataSet: DataSet[String] = ???
val transformedDataSet: DataSet[String] = someDataSet.filter(str => str.length > 5)

map vs filter in Apache Spark

3 Answers