8
votes

By looking at the scheduling stats in /proc/<PID>/sched, you can get an output like this:

[horro@system ~]$ cat /proc/1/sched
systemd (1, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :    2499611106.982616
se.vruntime                                  :          7952.917943
se.sum_exec_runtime                          :         58651.279127
se.nr_migrations                             :                53355
nr_switches                                  :               169561
nr_voluntary_switches                        :               168185
nr_involuntary_switches                      :                 1376
se.load.weight                               :              1048576
se.avg.load_sum                              :               343837
se.avg.util_sum                              :               338827
se.avg.load_avg                              :                    7
se.avg.util_avg                              :                    7
se.avg.last_update_time                      :     2499611106982616
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                  180
mm->numa_scan_seq                            :                    1
numa_pages_migrated                          :                  296
numa_preferred_nid                           :                    0
total_numa_faults                            :                   34
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=23 group_private=0 group_shared=0
numa_faults node=1 task_private=0 task_shared=0 group_private=0 group_shared=0
numa_faults node=2 task_private=0 task_shared=0 group_private=0 group_shared=0
numa_faults node=3 task_private=0 task_shared=11 group_private=0 group_shared=0
numa_faults node=4 task_private=0 task_shared=0 group_private=0 group_shared=0
numa_faults node=5 task_private=0 task_shared=0 group_private=0 group_shared=0
numa_faults node=6 task_private=0 task_shared=0 group_private=0 group_shared=0
numa_faults node=7 task_private=0 task_shared=0 group_private=0 group_shared=0

I have been trying to figure out what are the differences between migrations and switches, some responses here and here. Summarizing these responses:

  • nr_switches: number of context switches.
  • nr_voluntary_switches: number of voluntary switches, i.e. the thread blocked and hence another thread is picked up.
  • nr_involuntary_switches: the scheduler kicked the thread out as there is another hungry thread is ready to run.

Therefore, what are the migrations? Are these concepts related or not? Migrations are among cores and switches within a core?

1

1 Answers

13
votes

Migration is when a thread, usually after a context switch, get scheduled on a different CPU than it was scheduled before.

EDIT 1:

Here is more info on Wikipedia about the migration: https://en.wikipedia.org/wiki/Process_migration

Here is the kernel code increasing the counter: https://github.com/torvalds/linux/blob/master/kernel/sched/core.c#L1175

if (task_cpu(p) != new_cpu) {
    ...
    p->se.nr_migrations++;

EDIT 2:

A thread can migrate to another CPU in the following cases:

  1. During exec()
  2. During fork()
  3. During thread wake-up.
  4. If thread affinity mask has changed.
  5. When the current CPU is getting offline.

For more info please have a look at functions set_task_cpu(), move_queued_task(), migrate_tasks() in the same source file: https://github.com/torvalds/linux/blob/master/kernel/sched/core.c

The policies scheduler follows are described in select_task_rq(), which depend on the class of scheduler you are using. The basic version of the policier:

if (p->nr_cpus_allowed > 1)
    cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
else
    cpu = cpumask_any(&p->cpus_allowed);

Source: https://github.com/torvalds/linux/blob/master/kernel/sched/core.c#L1534

So in order to avoid the migration, set the CPU affinity mask for your threads using sched_setaffinity(2) system call or corresponding POSIX API pthread_setaffinity_np(3).

Here is the definition of select_task_rq() for the Completely Fair Scheduler: https://github.com/torvalds/linux/blob/master/kernel/sched/fair.c#L5860

The logic is quite complicated, but basically, we either select sibling idle CPU or find a least busy new one.

Hope this answers your question.