I'm running this scala code on a 32-bit quad-core Core2 system:
def job(i:Int,s:Int):Long = {
val r=(i to 500000000 by s).map(_.toLong).foldLeft(0L)(_+_)
println("Job "+i+" done")
r
}
import scala.actors.Future
import scala.actors.Futures._
val JOBS=4
val jobs=(0 until JOBS).toList.map(i=>future {job(i,JOBS)})
println("Running...")
val results=jobs.map(f=>f())
println(results.foldLeft(0L)(_+_))
(Yes, I do know there are much more efficient ways to sum a series of integers; it's just to give the CPU something to do).
Depending on what I set JOBS to, the code runs in the following times:
JOBS=1 : 31.99user 0.84system 0:28.87elapsed 113%CPU
JOBS=2 : 27.71user 1.12system 0:14.74elapsed 195%CPU
JOBS=3 : 33.19user 0.39system 0:13.02elapsed 257%CPU
JOBS=4 : 49.08user 8.46system 0:22.71elapsed 253%CPU
I'm surprised that this doesn't really scale well beyond 2 futures "in play". I do a lot of multithreaded C++ code and have no doubt I'd get good scaling up to 4 cores and see >390% CPU utilisation if I coded this sort of thing with Intel's TBB or boost::threads
(it'd be considerably more verbose of course).
So: what's going on and how can I get the scaling to 4 cores I'd expect to see ? Is this limited by something in scala or the JVM ? It occurs to me I don't actually know "where" scala's futures run... is a thread spawned per future, or does "Futures" provide a thread pool dedicated to running them ?
[I'm using the scala 2.7.7 packages from Debian/Squeeze on a Lenny system with sun-java6 (6-20-0lennny1).]
Update:
As suggested in Rex's answer, I recoded to avoid object creation.
def job(i:Long,s:Long):Long = {
var t=0L
var v=i
while (v<=10000000000L) {
t+=v
v+=s
}
println("Job "+i+" done")
t
}
// Rest as above...
This was so much faster I had to significantly increase the iteration count to run for any amount of time! Results are:
JOBS=1: 28.39user 0.06system 0:29.25elapsed 97%CPU
JOBS=2: 28.46user 0.04system 0:14.95elapsed 190%CPU
JOBS=3: 24.66user 0.06system 0:10.26elapsed 240%CPU
JOBS=4: 28.32user 0.12system 0:07.85elapsed 362%CPU
which is much more like what I'd hope to see (although the 3 jobs case is a little odd, with one task consistently completing a couple of seconds before the other two).
Pushing it a bit further, on a quad-core hyperthreaded i7 the latter version with JOBS=8
achieves an x4.4 speedup vs JOBS=1, with 571% CPU usage.