17
votes

I'm running this scala code on a 32-bit quad-core Core2 system:

def job(i:Int,s:Int):Long = {
  val r=(i to 500000000 by s).map(_.toLong).foldLeft(0L)(_+_)
  println("Job "+i+" done")
  r
}

import scala.actors.Future
import scala.actors.Futures._

val JOBS=4

val jobs=(0 until JOBS).toList.map(i=>future {job(i,JOBS)})
println("Running...")
val results=jobs.map(f=>f())
println(results.foldLeft(0L)(_+_))

(Yes, I do know there are much more efficient ways to sum a series of integers; it's just to give the CPU something to do).

Depending on what I set JOBS to, the code runs in the following times:

JOBS=1 : 31.99user 0.84system 0:28.87elapsed 113%CPU
JOBS=2 : 27.71user 1.12system 0:14.74elapsed 195%CPU
JOBS=3 : 33.19user 0.39system 0:13.02elapsed 257%CPU
JOBS=4 : 49.08user 8.46system 0:22.71elapsed 253%CPU

I'm surprised that this doesn't really scale well beyond 2 futures "in play". I do a lot of multithreaded C++ code and have no doubt I'd get good scaling up to 4 cores and see >390% CPU utilisation if I coded this sort of thing with Intel's TBB or boost::threads (it'd be considerably more verbose of course).

So: what's going on and how can I get the scaling to 4 cores I'd expect to see ? Is this limited by something in scala or the JVM ? It occurs to me I don't actually know "where" scala's futures run... is a thread spawned per future, or does "Futures" provide a thread pool dedicated to running them ?

[I'm using the scala 2.7.7 packages from Debian/Squeeze on a Lenny system with sun-java6 (6-20-0lennny1).]

Update:

As suggested in Rex's answer, I recoded to avoid object creation.

def job(i:Long,s:Long):Long = {
  var t=0L
  var v=i
  while (v<=10000000000L) {
    t+=v
    v+=s
  }
  println("Job "+i+" done")
  t
}
// Rest as above...

This was so much faster I had to significantly increase the iteration count to run for any amount of time! Results are:

JOBS=1: 28.39user 0.06system 0:29.25elapsed 97%CPU
JOBS=2: 28.46user 0.04system 0:14.95elapsed 190%CPU
JOBS=3: 24.66user 0.06system 0:10.26elapsed 240%CPU
JOBS=4: 28.32user 0.12system 0:07.85elapsed 362%CPU

which is much more like what I'd hope to see (although the 3 jobs case is a little odd, with one task consistently completing a couple of seconds before the other two).

Pushing it a bit further, on a quad-core hyperthreaded i7 the latter version with JOBS=8 achieves an x4.4 speedup vs JOBS=1, with 571% CPU usage.

2
You're just impatient, wanting the future today! Seriously, Rex hit the nail on the head, you're benchmarking garbage collection, not futures efficiency.Edwin Buck
Heh... too true. When I submitted this question I'd not been using Scala that long, and was probably a bit too credulous of some of the extreme hype surrounding it.timday
care to re-run the test with akka.dispatch.Future?Viktor Klang
@Viktor: Well this was submitted before I had access to 2.8. Now 2.8 has turned up parallel collections have pretty much obsoleted futures for parallelising compute (they still have good uses relating to managing asynchrony of course). Last time I looked at scala parallel performance I got some (good) results shown in the "performance" box of this poster: timday.bitbucket.org/project_euler-poster.pdf And as Rex notes below, the real issue is whether object churn kills the garbage collector. Any particular reason Akka would fare better there ?timday
I've hear reports that Akka 2.0 Futures can beat ParCol under certain circumstances. Akka Futures also use 0 locks and are very tiny.Viktor Klang

2 Answers

15
votes

My guess is that the garbage collector is doing more work than the addition itself. So you're limited by what the garbage collector can manage. Try running the test again with something that doesn't create any objects (e.g. use a while loop instead of the range/map/fold). You can also play with the parallel GC options if your real application will hit the GC this heavily.

2
votes

Try

(i to 500000000 by s).view.map(_.toLong).foldLeft(0L)(_+_)

The application of view is supposed to (as I understood id) to avoid repeated iteration and object creation by providing simple wrappers.

Note also that you can use reduceLeft(_+_) instead of fold.