0
votes

i'm wondering how works this little snippet:

if i have this text:

Ut quis pretium tellus. Fusce quis suscipit ipsum. Morbi viverra elit ut malesuada pellentesque. Fusce eu ex quis urna lobortis finibus. Integer aliquam faucibus neque id cursus. Nulla non massa odio. Fusce pretium felis felis, at malesuada felis blandit nec. Praesent ligula enim, gravida sit amet scelerisque eget, porta non mi. Aenean vitae maximus tortor, ac facilisis orci.

and this snippet code that count the occurences of each words on the text above:

        // Load  input data.
        JavaRDD<String> input = sc.textFile(inputFile);
        // Split up into words.
        JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
            public Iterable<String> call(String x) {
                return Arrays.asList(x.split(" "));
            }
        });
        // Transform into word and count.
        JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String x) {
                return new Tuple2(x, 1);
            }
        }).reduceByKey(new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer x, Integer y) {
                return x + y;
            }
        });

It's simple to understand that this line

JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
                public Iterable<String> call(String x) {
                    return Arrays.asList(x.split(" "));
                }
            });

creates a dataset containing the whole words splitted by space

and this line gives at each tuple the value of one, so for example:

JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
                public Tuple2<String, Integer> call(String x) {
                    return new Tuple2(x, 1);

Ut,1
quis,1 //go on

i'm confused on how reduceByKey works, and how it can count the occurences of each words?

thanks in advance.

2
Please, show how you got Ut,1\nquis,1\n...? Are you printing it or write into file or something else?sheh
i suppose that works in that way, I haven't print anything.OiRc

2 Answers

2
votes

reduceByKey groups tuples by the key (first argument in each tuple) and makes reduce for each of group.

Like this:

(Ut, 1), (quis, 1), ..., (quis, 1), ..., (quis, 1), ... mapToPair

               \            /             |                           reduceByKey
                      +
                 (quis, 1+1)              |
                       \                 /
                         \             /  
                                +
                            (quis, 2+1)
1
votes

reduceByKey is quite similar to reduce. They both take a function and use it to combine values.

reduceByKey runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key.

Because datasets can have very large numbers of keys, reduceByKey is not implemented as an action that returns a value to the user program. Instead, it returns a new RDD consisting of each key and the reduced value for that key.

Reference: Learning Spark - Lightning-Fast Big Data Analysis - Chapter 4 - Working with Key/Value Pairs.