3
votes

I have a bunch of commands to execute for gene sequecing. For example:

msclle_program -in 1.txt
msclle_program -in 2.txt
msclle_program -in 3.txt
      .........
msclle_program -in 10.txt

these commands are independent of each other.

The Environment is Linux Desktop, Intel i7(4 core/8 thread7, 12G memory

I can split these commands into different n.sh programs and run them simultaneously.

My question is How can I fully utilize multi-CPU, multi-Core and hyper-Threading to make the program run faster?

More specifically, how many program files should I split into?

My own understanding is:

  1. split into 7 program files. So each CPU will 100% run one program
  2. With one CPU, the CPU will utilize its multi-core and multi-thread by its own.

Is it True?

many thanks for ur comments.

4
This isn't explaining properly, so I'm just leaving it as a comment: you should run 8 instances of your program to fully utilize your CPU because you have 8 "cores" (this is assuming one instance can saturate a core - your program is CPU bound).thirtydot
It will depend on various factors; the best thing to do is run tests with varying numbers of simultaneous processes and plot them on a graph. You should see which number yields best performance on your hardware.Jeremy Friesner
@thirtydot: I have 7 real CPU. So I could split into 7*8 files and each CPU has 8 programs?teloon
The thing is, your computer is not just a set of CPUs operating in a vacuum. There is also RAM, and RAM caches, and the OS, and the OS's context switching overhead, and the hard disk(s), and the network, and so on. Contention for any of those resources can affect performance in ways that are not easily predictable in advance. That is why there is no substitute for actually trying various levels of parallelism and measuring their performance.Jeremy Friesner

4 Answers

6
votes

The answers is not simple or straightforward and splitting the task into one programme per CPU is likely to be non-optimal and may indeed be poor or even extremely poor.

First, as I understand it, you have seven quad-core CPUs (presumably there are eight, but you're saving one for the OS?). If you run a single threaded process on each CPU, you will be using a single thread on a single core. The other three cores and all of the hyperthreads will not be used.

The hardware and OS cannot split a single thread over multiple cores.

You could however run four single-threaded processes per CPU (one per core), or even eight (one per hyperthread). Whether or not this is optimal depends on the work being done by the processes; in particular, their working set size and memory access patterns, and upon the hardware cache arrangements; the number of levels of cache, their sizes and their sharing. Also the NUMA arrangement of the cores needs to be considered.

Basically speaking, an extra thread has to give you quite a bit of speed-up to outweigh what it can cost you in cache utilization, main memory accesses and the disruption of pre-fetching.

Furthermore, because the effects of the working set exceeding certain caching limits is profound, what seems fine for say one or two cores may be appalling for four or eight, so you can't even experiment with one core and assume the results are useful over eight.

Having a quick look, I see i7 has a small L2 cache and a huge L3 cache. Given your data set, I wouldn't be surprised if there's a lot of data being processed. The question is whether or not it is sequentially processed (e.g. if prefetching will be effective). If the data is not sequentially processed, you may do better by reducing the number of concurrent processes, so their working sets tend to fit inside the L3 cache. I suspect if you run eight or sixteen processes, the L3 cache will be hammered - overflowed. OTOH, if your data access is non-sequential, the L3 cache prolly isn't going to save you anyway.

1
votes

You can spawn multiple processess and then assign each process to one cpu. You can use taskset -c to do this.

Have a rolling number and increment to specify the processor number.

1
votes

split into 7 program files. So each CPU will 100% run one program.

This is approximately correct: if you have 7 single-threaded programs and 7 processing units, then each of them has one thread to run. This is optimal: less programs, and some processing units would be idle; more programs, and time would be wasted to alternating between them. Although, if you have 7 quad-core processors, then the optimum number of threads (from "CPU bound perspective") would be 28. This is simplified, as in reality there will be other programs around to share the CPU.

With one CPU, the CPU will utilize its multi-core and multi-thread by its own.

No. Whether or not all cores are in the single CPU or not makes little difference (it does make some difference in caching, though). Anyway, the processor won't do any multithreading by its own. It's the programmer's job. That's why making programs faster has become very challenging nowadays: until about 2005 or so it was free ride, as the clock frequencies were steadily rising, but now the limit has been reached, and speeding up programs requires splitting them into the growing number of processing units. It's one of the major reasons for the renaissance of functional programming.

0
votes

Why run them as separate processes? Consider running multiple threads in one process instead which would make both the memory footprint much smaller and lower the amount of process scheduling required.

You could look at it this way (a bit over-simplified but still):

Consider dividing up your work into processable units (PU). You then want two or more cores to each process one PU at a time such that they don't interfere with each other and the more cores the more PUs you can process.

The work involved for processing one PU is input+processing+output (I+P+O). Since it is probably processing units from large memory structures containing perhaps millions or more the input and output have mostly to do with memory. With one core this is not a problem because no other core interferes with the memory accesses. With multiple cores the problem is moved basically to the nearest common resource, in this case the L3 cache giving cache input (CI) and cache output (CO). With two cores you would want CI+CO to equal P/2 or less because then the two cores could take turns accessing the nearest common resource (the L3 cache) and not interfere with each other. With three cores CI+CO would need to be P/3 and four or eight cores you would need CI+CO to equal P/4 or P/8.

So the trick is to make the processing required for a PU reside completely inside a core and its own caches (L1 and L2). The more cores you have the larger the PUs should be (in relation to the I/O required) such that the PU stays isolated inside its core as long as possible and with all the data it needs available in its local caches.

To sum it up you want the cores to do as much meaningful and efficient processing as possible while impacting the L3 cache as little as possible because the L3 cache is the bottleneck. It's a challenge to achieve such a balance but by no means impossible.

As you understand, the cores executing "traditional" multi-threaded administrative or web applications (where no care whatsoever is taken to economize on L3 accesses) will constantly be colliding with each other for access to the L3 cache and resources further out. It is not uncommon for multi-threaded programs running on multiple cores to be slower than if they'd been running on single cores.

Also, don't forget that OS work impacts the cache (a lot) as well. If you divide the problem into separate processes (as I mentioned above) you'll be calling in the OS to referee much more often than is absolutely neccessary.

My experience is that the existence, dos and don'ts of the problem are mostly unknown or not understood.