I currently have a program that benefits greatly from multithreading. It starts n threads, each thread does 100M iterations. They all use shared memory but there is no synchronization at all. It approximates some equation solutions and current benchmarks are:
1 thread: precision 1 time: 150s
4 threads: precision 4 time: 150s
16 threads: precision 16 time: 150s
32 threads: precision 32 time: 210s
64 threads: precision 64 time: 420s
(Higher precision is better)
I use Amazon EC2 'Cluster Compute Eight Extra Large Instance' which has 2 x Intel Xeon E5-2670 As far as I understand, it has 16 real cores, thus program has linear improvement up to 16 cores. Also it has 2x 'hyper-threading' and my program gains somewhat from this. Making number of threads more than 32 is obviously gives no improvement.
These benchmarks prove that access to RAM is not 'bottleneck'.
Also I ran the program on Intel Xeon E5645 which has 12 real cores. Results are:
1 thread: precision 1 time: 150s
4 threads: precision 4 time 150s
12 threads: precision 12 time 150s
24 threads: precision 24 time 220s
precision/(time*thread#)
is similar to Amazon computer, which is not clear for me, because each core in Xeon E5-2670 is ~1.5 faster according to cpu MHz (~1600 vs ~2600) and
http://www.cpubenchmark.net/cpu_list.php 'Passmark CPU Mark' number adjusted for
- Why using faster processor does not improve single-threaded performance while increasing number of threads does?
- Is it possible to rent some server that will have Multiple CPU more powerful than 2 x Intel Xeon E5-2670 while using the shared RAM, so I can run my program without any changes and get better results?
Update:
13 threads on Xeon5645 take 196 seconds.
Algorithm randomly explores tree which has 3500 nodes. Height of tree is 7. Each node contains 250 doubles which are also randomly accessed. It is very likely that almost no data is cached.