how fast is Google App Engine MapReduce?

Question

How much of a compute-intensive gain can one expect on GAE MapReduce? The scenario of interest to me is compute intensive, so for example: multiplying a trillion random floats in a single threaded single core application. Then imagine 1000 MapReduce workers multiplying a billion random numbers each and announcing "finished" when all workers have finished. Assume billing is enabled if that matters. (It might not).

Edit: A commenter asked for clarification. The title has been revised. If the task takes 50000 seconds single threaded and in an alternative implementation 1000 MapReduce workers are employed and they finish after 500 seconds, then the performance gain is 100 times. 1000 workers: 100 times gain, only slightly disappointing, but so be it for this example. How can I get finished sooner? Can I ask for 10,000 workers? This question may have to do with limits and quotas. Assume an adequate budget. Does MapReduce's compute-intensive performance gain head to an asymptote and if so what is the performance gain at that asymptote? There was also information in the comment about MapReduce being suitable for large amounts of data generated by a user facing URL however, my question is not in regard to a Datastore-intensive application's performance versus the same application rewritten for MapReduce. Datastore activity will be minimal in this compute-intensive scenario. I realize there will always be some Datastore activity in any MapReduce application, but since this is a compute-intensive scenario, the Datastore activity and the size of the Datastore entities is not going to be a big influence on the performance gain calculated. The task will use the Datastore for less than 1% of the elapsed time. Nor is the scenario involving a large amount of communication bandwidth (other than the minimum necessary to hit the task queued URLs that MapReduce uses). The question is in regard to comparing a compute-intensive single threaded non-MapReduce task's elapsed time to the same task's elapsed time on MapReduce which is inherently multi-threaded given there are multiple workers. I use the word "task" generically, in other words, "task means work". The gain might (but not necessarily) be a function of the number of workers hence I mentioned 1000 workers in the example.

Nick Johnson Nick Johnson · Accepted Answer · 2011-03-31T01:56:56

It's not clear exactly what you're asking here. Are you asking how efficient it is? How cheap it is? How fast it is?

In general, App Engine is designed for serving user-facing sites, and the App Engine mapreduce API exists to assist with that - processing large amounts of data generated by the user-facing site. If you have a large amount of data that's hosted outside App Engine, and you want to do some sort of large-scale data processing on it, App Engine is probably not the tool for you.

Regarding performance, you can expect each worker to execute tasks as fast as they would be if you were executing them serially, so your items-per-second is roughly the number of workers multiplied by the regular rate - there's relatively little overhead. There can be some delay at the end, though, when different workers finish at different times, and how much this is depends on how good a job mapreduce does of sharding your data. With datastore input, this used to be fairly poor, but it's a lot better now.

As to how many mappers you can have, that depends on a number of things: Whether or not your app has billing enabled, how much other traffic your app gets, and how long your mapper tasks take per element. The only real way to determine this is to experiment a bit.

how fast is Google App Engine MapReduce?

1 Answers