i'm wondering how works this little snippet:
if i have this text:
Ut quis pretium tellus. Fusce quis suscipit ipsum. Morbi viverra elit ut malesuada pellentesque. Fusce eu ex quis urna lobortis finibus. Integer aliquam faucibus neque id cursus. Nulla non massa odio. Fusce pretium felis felis, at malesuada felis blandit nec. Praesent ligula enim, gravida sit amet scelerisque eget, porta non mi. Aenean vitae maximus tortor, ac facilisis orci.
and this snippet code that count the occurences of each words on the text above:
// Load input data.
JavaRDD<String> input = sc.textFile(inputFile);
// Split up into words.
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
// Transform into word and count.
JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2(x, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y;
}
});
It's simple to understand that this line
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
creates a dataset containing the whole words splitted by space
and this line gives at each tuple the value of one, so for example:
JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2(x, 1);
Ut,1
quis,1 //go on
i'm confused on how reduceByKey
works, and how it can count the occurences of each words?
thanks in advance.
Ut,1\nquis,1\n...
? Are you printing it or write into file or something else? – sheh