2
votes

From official documentation for Apache Spark:

http://spark.apache.org/docs/latest/rdd-programming-guide.html

map(func):Return a new distributed dataset formed by passing each element of the source through a function func.

filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.

Going by bold words, is it a big difference?And is it really a difference?

3

3 Answers

5
votes

They serve different purposes. If we look at the (simplified) method definition for map:

def map[U](func: (T) ⇒ U): Dataset[U]

expects that given an element of type T, you yield an element of type U, forall T, resulting in a Dataset[U]. In other words, a means of transforming an element of type T to type U.

On the other hand, filter:

def filter(func: (T) ⇒ Boolean): Dataset[T]

expects that given an element of type T, you provide a Boolean value which says if that given element should be yielded back in the resulting Dataset[T] or not (often refereed to as a Predicate).

A concrete example of map can be:

val someDataSet: DataSet[String] = ???
val transformedDataSet: DataSet[Int] = someDataSet.map(str => str.toInt)

And for filter:

val someDataSet: DataSet[String] = ???
val transformedDataSet: DataSet[String] = someDataSet.filter(str => str.length > 5)
4
votes

It's really just a difference from the end-user in how you use the API. map is meant to take a record as input and return a record that you've applied some function to. Whereas filter is meant to take a record as input and return a boolean. Internally Spark will execute both with mapPartitions.

0
votes

Think the whole RDD as a pipeline where you throw apples, bananas and peaches, then in the pipeline, you have a filter that only allow go through apple [filter], so then you have just apple, and then you want to transform an apple into a candy apple applying some sugar [map]

So you start with apples, bananas and peaches and you end up producing candy apples.