how to split program to fully utilize multi-CPU, multi-Core and hyper-Threading?

Question

I have a bunch of commands to execute for gene sequecing. For example:

msclle_program -in 1.txt
msclle_program -in 2.txt
msclle_program -in 3.txt
      .........
msclle_program -in 10.txt

these commands are independent of each other.

The Environment is Linux Desktop, Intel i7(4 core/8 thread)×7, 12G memory

I can split these commands into different n.sh programs and run them simultaneously.

My question is How can I fully utilize multi-CPU, multi-Core and hyper-Threading to make the program run faster?

More specifically, how many program files should I split into?

My own understanding is:

split into 7 program files. So each CPU will 100% run one program
With one CPU, the CPU will utilize its multi-core and multi-thread by its own.

Is it True?

many thanks for ur comments.

This isn't explaining properly, so I'm just leaving it as a comment: you should run 8 instances of your program to fully utilize your CPU because you have 8 "cores" (this is assuming one instance can saturate a core - your program is CPU bound). — thirtydot
It will depend on various factors; the best thing to do is run tests with varying numbers of simultaneous processes and plot them on a graph. You should see which number yields best performance on your hardware. — Jeremy Friesner
@thirtydot: I have 7 real CPU. So I could split into 7*8 files and each CPU has 8 programs? — teloon
The thing is, your computer is not just a set of CPUs operating in a vacuum. There is also RAM, and RAM caches, and the OS, and the OS's context switching overhead, and the hard disk(s), and the network, and so on. Contention for any of those resources can affect performance in ways that are not easily predictable in advance. That is why there is no substitute for actually trying various levels of parallelism and measuring their performance. — Jeremy Friesner

Unknown Unknown · Accepted Answer · 2011-01-21T13:17:26

The answers is not simple or straightforward and splitting the task into one programme per CPU is likely to be non-optimal and may indeed be poor or even extremely poor.

First, as I understand it, you have seven quad-core CPUs (presumably there are eight, but you're saving one for the OS?). If you run a single threaded process on each CPU, you will be using a single thread on a single core. The other three cores and all of the hyperthreads will not be used.

The hardware and OS cannot split a single thread over multiple cores.

You could however run four single-threaded processes per CPU (one per core), or even eight (one per hyperthread). Whether or not this is optimal depends on the work being done by the processes; in particular, their working set size and memory access patterns, and upon the hardware cache arrangements; the number of levels of cache, their sizes and their sharing. Also the NUMA arrangement of the cores needs to be considered.

Basically speaking, an extra thread has to give you quite a bit of speed-up to outweigh what it can cost you in cache utilization, main memory accesses and the disruption of pre-fetching.

Furthermore, because the effects of the working set exceeding certain caching limits is profound, what seems fine for say one or two cores may be appalling for four or eight, so you can't even experiment with one core and assume the results are useful over eight.

Having a quick look, I see i7 has a small L2 cache and a huge L3 cache. Given your data set, I wouldn't be surprised if there's a lot of data being processed. The question is whether or not it is sequentially processed (e.g. if prefetching will be effective). If the data is not sequentially processed, you may do better by reducing the number of concurrent processes, so their working sets tend to fit inside the L3 cache. I suspect if you run eight or sixteen processes, the L3 cache will be hammered - overflowed. OTOH, if your data access is non-sequential, the L3 cache prolly isn't going to save you anyway.

how to split program to fully utilize multi-CPU, multi-Core and hyper-Threading?

4 Answers