Apache spark map-reduce explanation

Question

i'm wondering how works this little snippet:

if i have this text:

Ut quis pretium tellus. Fusce quis suscipit ipsum. Morbi viverra elit ut malesuada pellentesque. Fusce eu ex quis urna lobortis finibus. Integer aliquam faucibus neque id cursus. Nulla non massa odio. Fusce pretium felis felis, at malesuada felis blandit nec. Praesent ligula enim, gravida sit amet scelerisque eget, porta non mi. Aenean vitae maximus tortor, ac facilisis orci.

and this snippet code that count the occurences of each words on the text above:

        // Load  input data.
        JavaRDD<String> input = sc.textFile(inputFile);
        // Split up into words.
        JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
            public Iterable<String> call(String x) {
                return Arrays.asList(x.split(" "));
            }
        });
        // Transform into word and count.
        JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String x) {
                return new Tuple2(x, 1);
            }
        }).reduceByKey(new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer x, Integer y) {
                return x + y;
            }
        });

It's simple to understand that this line

JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
                public Iterable<String> call(String x) {
                    return Arrays.asList(x.split(" "));
                }
            });

creates a dataset containing the whole words splitted by space

and this line gives at each tuple the value of one, so for example:

JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
                public Tuple2<String, Integer> call(String x) {
                    return new Tuple2(x, 1);

Ut,1
quis,1 //go on

i'm confused on how reduceByKey works, and how it can count the occurences of each words?

thanks in advance.

Please, show how you got Ut,1\nquis,1\n...? Are you printing it or write into file or something else? — sheh

sheh sheh · Accepted Answer · 2015-06-04T13:42:20

reduceByKey groups tuples by the key (first argument in each tuple) and makes reduce for each of group.

Like this:

(Ut, 1), (quis, 1), ..., (quis, 1), ..., (quis, 1), ... mapToPair

               \            /             |                           reduceByKey
                      +
                 (quis, 1+1)              |
                       \                 /
                         \             /  
                                +
                            (quis, 2+1)

Apache spark map-reduce explanation

2 Answers