3
votes

I'm going through an existing Spark code, as part of my learning process, and I came across the following code:

enPages.map(lambda x: x.split(" "))\
        .map(lambda x: (x[2], int(x[3])))\
        .reduceByKey(lambda x, y: x + y, 40)\
        .collect()

I have a basic understood of how reduceByKey and lambda work, but in the above code, I did not get the utility of '40' in the reduceByKey/lambda function above. Any clarification would be appreciated.

Thanks, Sasi.

PS: I tried running the code by removing the '40' and compared the output, only to find that it is just the order of the key/value pairs that has changed, but for a given key the count of the value is the final output is the same.

1

1 Answers

4
votes

This is about a repartition that you can do at reduceByKey. According Apache Spark documentation here.

The function:

.reduceByKey(lambda x, y: x + y, 40)

what is happening in this peace of code, you are running the lambda code as you already know doing a sum of the accumulator x with the values y. And the 40 is the number of partitions that are going to be created to handle this execution.

This is really useful if you want to create a new number of partitions with no need to do an extra shuffle.