6
votes

I am trying to understand how load balancer works on multiprocessor system in Linux kernel,

Linux scheduler basically uses runques to store the tasks which it has to run next, now taking situation of a multiprocessor system the way load_balancer() is implemented an explanation as given in Robert Loves book Linux Kernel Development 2nd edition is following

First, load_balance() calls find_busiest_queue() to determine the busiest runqueue. In other words, this is the runqueue with the greatest number of processes in it. If there is no runqueue that has 25% or more processes than the current, find_busiest_queue() returns NULL and load_balance() returns. Otherwise, the busiest runqueue is returned.

Second, load_balance() decides which priority array on the busiest runqueue it wants to pull from. The expired array is preferred because those tasks have not run in a relatively long time, thus are most likely not in the processor's cache (that is, they are not cache hot). If the expired priority array is empty, the active one is the only choice.

Next, load_balance() finds the highest priority (smallest value) list that has tasks, because it is more important to fairly distribute high priority tasks than lower priority ones.

Each task of the given priority is analyzed, to find a task that is not running, not prevented to migrate via processor affinity, and not cache hot. If the task meets this criteria, pull_task() is called to pull the task from the busiest runqueue to the current runqueue.

As long as the runqueues remain imbalanced, the previous two steps are repeated and more tasks are pulled from the busiest runqueue to the current. Finally, when the imbalance is resolved, the current runqueue is unlocked and load_balance()returns.

the code is following

static int load_balance(int this_cpu, runqueue_t *this_rq,
                        struct sched_domain *sd, enum idle_type idle)
{
        struct sched_group *group;
        runqueue_t *busiest;
        unsigned long imbalance;
        int nr_moved;

        spin_lock(&this_rq->lock);

        group = find_busiest_group(sd, this_cpu, &imbalance, idle);
        if (!group)
                goto out_balanced;

        busiest = find_busiest_queue(group);
        if (!busiest)
                goto out_balanced;

        nr_moved = 0;
        if (busiest->nr_running > 1) {
                double_lock_balance(this_rq, busiest);
                nr_moved = move_tasks(this_rq, this_cpu, busiest,
                                      imbalance, sd, idle);
                spin_unlock(&busiest->lock);
        }
        spin_unlock(&this_rq->lock);

        if (!nr_moved) {
                sd->nr_balance_failed++;

                if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
                        int wake = 0;

                        spin_lock(&busiest->lock);
                        if (!busiest->active_balance) {
                                busiest->active_balance = 1;
                                busiest->push_cpu = this_cpu;
                                wake = 1;
                        }
                        spin_unlock(&busiest->lock);
                        if (wake)
                                wake_up_process(busiest->migration_thread);
                        sd->nr_balance_failed = sd->cache_nice_tries;
                }
        } else
                sd->nr_balance_failed = 0;

        sd->balance_interval = sd->min_interval;

        return nr_moved;

out_balanced:
        spin_unlock(&this_rq->lock);

        if (sd->balance_interval < sd->max_interval)
                sd->balance_interval *= 2;

        return 0; 
}

What I am not clear with is a structure in above code struct sched_domain *sd this structure I checked is defined in include/linux/sched.h as follows http://lxr.linux.no/linux+v3.7.1/include/linux/sched.h#L895 it is a big structure so I have just given a link for simplicity. What I want to know is what is the use of struct sched_domain in above code?

Why is this used when load_balancer() is called what does this struct stands for?

a bit of things are given here probably http://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt why does a CPU needs scheduling domains? What do these domains stand for?

1

1 Answers

15
votes

Scheduling domains and scheduler groups/cpu groups help to ease the process of scheduling tasks like:

  1. load balancing tasks across cpus.
  2. choosing a cpu for a new task to run on.
  3. choosing a cpu for a sleeping task to run when it wakes up.

It has a two fold advantage:

  1. It organises the cpus in the system very well into groups and hierarchies.

  2. It organises the cpus in such a way that it is useful.All cpus which
    share an l2 cache belong to one domain.All cpus which share an l3 cache
    belong to a higher level domain,which encompasses all the domains which
    share the l2 cache.

The advantages that you see with a tree like data structure are similar here to the advantages of scheduler domains and groups.

Refer to following diagram

     _________sd1________
    /                    \
    ----------------------
         l3 cache
    ----------------------
    ---------   ----------
    l2 cache    l2 cache
    ---------   ----------
    cpu0 cpu1   cpu2 cpu3
    \_______/   \________/
      sd0          sd0

 ________sd1_________
/                    \
----------------------
      l3 cache
----------------------
---------   ----------
l2 cache    l2 cache
---------   ----------
cpu4 cpu5   cpu6 cpu7
\_______/   \________/
  sd0          sd0

What you see above is a scheduler domain hierarchy.sd1 encompasses sd0s which happen to be scheduler groups of sd1.Every cpu has a scheduler domain hierarchy associated with it.For eg.
cpu0->sd=sd0; sd0->parent=sd1.This way through a linked list we can iterate through all the scheduler domains to which a cpu belongs to.

How does this help?

1.load balancing: Say cpu0 is idle and is ready to pull tasks upon itself to relieve any other burdened cpu.In the above approach,it first checks if the other cpus that belong to the first level sched domain ,needs to be relieved of load.Here, cpu1.If so it takes on tasks from cpu1,else it goes to the higher level domain sd1.If it chooses to migrate task from cpu1 it is the best thing,because the cache contents can be utilized;shared cache.no need to fetch from memory again.This is the first advantage:sched domains are formed based upon the advantages that hardware has to provide.

If it goes to sd1,then it probes sd1's 'groups',both the sd0s.Here is the next advantage.It needs information about the sched group alone and will not bother about the individual cpus in it.it checks if load(sd0[cpu2,cpu3]) > load(sd0[cpu0,cpu1]) Only if this is true does it go on to see if cpu2/3 is more loaded.If there were no scheduler domain or groups,we would have to see the states of cpu2 and cpu3 in two iterations instead of 1 iteration like we are doing now.

Now scale this problem and solution to 128 cpus! imagine what a mess it would have been if there was nothing to tell you which cpu would be the best to relieve load from,in the worst case you would have to iterate through all the 128 cpus.

But with scheduler domain or groups,say you divide the 128 cpus into groups of 16 cpus,you would have 8 groups.see which is the busiest,so that would be 8 iterations,then you would know the busiest group,then descend down.another 16 iterations.so worst case

8+16 = 24 iterations.And this decrease is only with one level of sched domain. Imagine if you had more levels,you would make the number of iterations even lower.

So in short the scheduler domains and groups are a 'divide and conquer ;but conquer as much as possible what is more useful' solution to scheduling related stuff.

I posted in case some one in future might want to read it.