I am trying to understand how load balancer works on multiprocessor system in Linux kernel,
Linux scheduler basically uses runques to store the tasks which it has to run next, now taking situation of a multiprocessor system the way load_balancer() is implemented an explanation as given in Robert Loves book Linux Kernel Development 2nd edition is following
First, load_balance() calls find_busiest_queue() to determine the busiest runqueue. In other words, this is the runqueue with the greatest number of processes in it. If there is no runqueue that has 25% or more processes than the current, find_busiest_queue() returns NULL and load_balance() returns. Otherwise, the busiest runqueue is returned.
Second, load_balance() decides which priority array on the busiest runqueue it wants to pull from. The expired array is preferred because those tasks have not run in a relatively long time, thus are most likely not in the processor's cache (that is, they are not cache hot). If the expired priority array is empty, the active one is the only choice.
Next, load_balance() finds the highest priority (smallest value) list that has tasks, because it is more important to fairly distribute high priority tasks than lower priority ones.
Each task of the given priority is analyzed, to find a task that is not running, not prevented to migrate via processor affinity, and not cache hot. If the task meets this criteria, pull_task() is called to pull the task from the busiest runqueue to the current runqueue.
As long as the runqueues remain imbalanced, the previous two steps are repeated and more tasks are pulled from the busiest runqueue to the current. Finally, when the imbalance is resolved, the current runqueue is unlocked and load_balance()returns.
the code is following
static int load_balance(int this_cpu, runqueue_t *this_rq,
struct sched_domain *sd, enum idle_type idle)
{
struct sched_group *group;
runqueue_t *busiest;
unsigned long imbalance;
int nr_moved;
spin_lock(&this_rq->lock);
group = find_busiest_group(sd, this_cpu, &imbalance, idle);
if (!group)
goto out_balanced;
busiest = find_busiest_queue(group);
if (!busiest)
goto out_balanced;
nr_moved = 0;
if (busiest->nr_running > 1) {
double_lock_balance(this_rq, busiest);
nr_moved = move_tasks(this_rq, this_cpu, busiest,
imbalance, sd, idle);
spin_unlock(&busiest->lock);
}
spin_unlock(&this_rq->lock);
if (!nr_moved) {
sd->nr_balance_failed++;
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
int wake = 0;
spin_lock(&busiest->lock);
if (!busiest->active_balance) {
busiest->active_balance = 1;
busiest->push_cpu = this_cpu;
wake = 1;
}
spin_unlock(&busiest->lock);
if (wake)
wake_up_process(busiest->migration_thread);
sd->nr_balance_failed = sd->cache_nice_tries;
}
} else
sd->nr_balance_failed = 0;
sd->balance_interval = sd->min_interval;
return nr_moved;
out_balanced:
spin_unlock(&this_rq->lock);
if (sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
return 0;
}
What I am not clear with is a structure in above code struct sched_domain *sd this structure I checked is defined in include/linux/sched.h as follows http://lxr.linux.no/linux+v3.7.1/include/linux/sched.h#L895 it is a big structure so I have just given a link for simplicity. What I want to know is what is the use of struct sched_domain in above code?
Why is this used when load_balancer() is called what does this struct stands for?
a bit of things are given here probably http://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt why does a CPU needs scheduling domains? What do these domains stand for?