64
votes

Let's say I had a program in C# that did something computationally expensive, like encoding a list of WAV files into MP3s. Ordinarily I would encode the files one at a time, but let's say I wanted the program to figure out how many CPU cores I had and spin up an encoding thread on each core. So, when I run the program on a quad core CPU, the program figures out it's a quad core CPU, figures out there are four cores to work with, then spawns four threads for the encoding, each of which is running on its own separate CPU. How would I do this?

And would this be any different if the cores were spread out across multiple physical CPUs? As in, if I had a machine with two quad core CPUs on it, are there any special considerations or are the eight cores across the two dies considered equal in Windows?

10

10 Answers

61
votes

Don't bother doing that.

Instead use the Thread Pool. The thread pool is a mechanism (actually a class) of the framework that you can query for a new thread.

When you ask for a new thread it will either give you a new one or enqueue the work until a thread get freed. In that way the framework is in charge on deciding wether it should create more threads or not depending on the number of present CPUs.

Edit: In addition, as it has been already mentioned, the OS is in charge of distributing the threads among the different CPUs.

17
votes

It is not necessarily as simple as using the thread pool.

By default, the thread pool allocates multiple threads for each CPU. Since every thread which gets involved in the work you are doing has a cost (task switching overhead, use of the CPU's very limited L1, L2 and maybe L3 cache, etc...), the optimal number of threads to use is <= the number of available CPU's - unless each thread is requesting services from other machines - such as a highly scalable web service. In some cases, particularly those which involve more hard disk reading and writing than CPU activity, you can actually be better off with 1 thread than multiple threads.

For most applications, and certainly for WAV and MP3 encoding, you should limit the number of worker threads to the number of available CPU's. Here is some C# code to find the number of CPU's:

int processors = 1;
string processorsStr = System.Environment.GetEnvironmentVariable("NUMBER_OF_PROCESSORS");
if (processorsStr != null)
    processors = int.Parse(processorsStr);

Unfortunately, it is not as simple as limiting yourself to the number of CPU's. You also have to take into account the performance of the hard disk controller(s) and disk(s).

The only way you can really find the optimal number of threads is trial an error. This is particularly true when you are using hard disks, web services and such. With hard disks, you might be better off not using all four processers on you quad processor CPU. On the other hand, with some web services, you might be better off making 10 or even 100 requests per CPU.

12
votes

Although I agree with most of the answers here, I think it's worth it to add a new consideration: Speedstep technology.

When running a CPU intensive, single threaded job on a multi-core system, in my case a Xeon E5-2430 with 6 real cores (12 with HT) under windows server 2012, the job got spread out among all the 12 cores, using around 8.33% of each core and never triggering a speed increase. The CPU remained at 1.2 GHz.

When I set the thread affinity to a specific core, it used ~100% of that core, causing the CPU to max out at 2.5 GHz, more than doubling the performance.

This is the program I used, which just loops increasing a variable. When called with -a, it will set the affinity to core 1. The affinity part was based on this post.

using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.InteropServices;
using System.Threading;

namespace Esquenta
{
    class Program
    {
        private static int numThreads = 1;
        static bool affinity = false;
        static void Main(string[] args)
        {
            if (args.Contains("-a"))
            {
                affinity = true;
            }
            if (args.Length < 1 || !int.TryParse(args[0], out numThreads))
            {
                numThreads = 1;
            }
            Console.WriteLine("numThreads:" + numThreads);
            for (int j = 0; j < numThreads; j++)
            {
                var param = new ParameterizedThreadStart(EsquentaP);
                var thread = new Thread(param);
                thread.Start(j);
            }

        }

        static void EsquentaP(object numero_obj)
        {
            int i = 0;
            DateTime ultimo = DateTime.Now;
            if(affinity)
            {
                Thread.BeginThreadAffinity();
                CurrentThread.ProcessorAffinity = new IntPtr(1);
            }
            try
            {
                while (true)
                {
                    i++;
                    if (i == int.MaxValue)
                    {
                        i = 0;
                        var lps = int.MaxValue / (DateTime.Now - ultimo).TotalSeconds / 1000000;
                        Console.WriteLine("Thread " + numero_obj + " " + lps.ToString("0.000") + " M loops/s");
                        ultimo = DateTime.Now;
                    }
                }
            }
            finally
            {
                Thread.EndThreadAffinity();
            }
        }

        [DllImport("kernel32.dll")]
        public static extern int GetCurrentThreadId();

        [DllImport("kernel32.dll")]
        public static extern int GetCurrentProcessorNumber();
        private static ProcessThread CurrentThread
        {
            get
            {
                int id = GetCurrentThreadId();
                return Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Single(x => x.Id == id);
            }
        }
    }
}

And the results:

results

Processor speed, as shown by Task manager, similar to what CPU-Z reports:

enter image description here

9
votes

In the case of managed threads, the complexity of doing this is a degree greater than that of native threads. This is because CLR threads are not directly tied to a native OS thread. In other words, the CLR can switch a managed thread from native thread to native thread as it sees fit. The function Thread.BeginThreadAffinity is provided to place a managed thread in lock-step with a native OS thread. At that point, you could experiment with using native API's to give the underlying native thread processor affinity. As everyone suggests here, this isn't a very good idea. In fact there is documentation suggesting that threads can receive less processing time if they are restricted to a single processor or core.

You can also explore the System.Diagnostics.Process class. There you can find a function to enumerate a process' threads as a collection of ProcessThread objects. This class has methods to set ProcessorAffinity or even set a preferred processor -- not sure what that is.

Disclaimer: I've experienced a similar problem where I thought the CPU(s) were under utilized and researched a lot of this stuff; however, based on all that I read, it appeared that is wasn't a very good idea, as evidenced by the comments posted here as well. However, it's still interesting and a learning experience to experiment.

6
votes

You can definitely do this by writing the routine inside your program.

However you should not try to do it, since the Operating System is the best candidate to manage these stuff. I mean user mode program should not do try to do it.

However, sometimes, it can be done (for really advanced user) to achieve the load balancing and even to find out true multi thread multi core problem (data racing/cache coherence...) as different threads would be truly executing on different processor.

Having said that, if you still want to achieve we can do it in the following way. I am providing you the pseudo code for(Windows OS), however they could easily be done on Linux as well.

#define MAX_CORE 256
processor_mask[MAX_CORE] = {0};
core_number = 0;

Call GetLogicalProcessorInformation();
// From Here we calculate the core_number and also we populate the process_mask[] array
// which would be used later on to set to run different threads on different CORES.


for(j = 0; j < THREAD_POOL_SIZE; j++)
Call SetThreadAffinityMask(hThread[j],processor_mask[j]);
//hThread is the array of handles of thread.
//Now if your number of threads are higher than the actual number of cores,
// you can use reset the counters(j) once you reach to the "core_number".

After the above routine is called, the threads would always be executing in the following manner:

Thread1-> Core1
Thread2-> Core2
Thread3-> Core3
Thread4-> Core4
Thread5-> Core5
Thread6-> Core6
Thread7-> Core7
Thread8-> Core8

Thread9-> Core1
Thread10-> Core2
...............

For more information, please refer to manual/MSDN to know more about these concepts.

3
votes

You shouldn't have to worry about doing this yourself. I have multithreaded .NET apps running on dual-quad machines, and no matter how the threads are started, whether via the ThreadPool or manually, I see a nice even distribution of work across all cores.

2
votes

Where each thread goes is generally handled by the OS itself...so generate 4 threads on a 4 core system and the OS will decide which cores to run each on, which will usually be 1 thread on each core.

2
votes

It is the operating system's job to split threads across different cores, and it will do so when automatically when your threads are using a lot of CPU time. Don't worry about that. As for finding out how many cores your user has, try Environment.ProcessorCount in C#.

2
votes

you cannot do this, as only operating system has the privileges to do it. If you will decide it.....then it will be difficult to code applications. Because then you also need to take care for inter-processor communication. critical sections. for each application you have to create you own semaphores or mutex......to which operating system gives a common solution by doing it itself.......

1
votes

One of the reasons you should not (as has been said) try to allocated this sort of stuff yourself, is that you just don't have enough information to do it properly, particularly into the future with NUMA, etc.

If you have a thread read-to-run, and there's a core idle, the kernel will run your thread, don't worry.