In multi-core machine, Linux OS, when process scheduler will migrate one process to another cpu

Question

In my program, whose rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds. In this case, one cpu will 100% sys when execute fork, at the same time, one thread cannot get cpu time until fork finish. The machine has 16 CPUs, the other CPUs is idle.

So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu? In general, when and how the scheduler migrate process between cpus?

I search this site, and the existing threads cannot answer my question.

Why do you think that this other thread is starving for the cpu? May be it is sleeping on some resource/memory lock, unavailable during fork. — oakad
sorry, my desc is not clear. Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it. with help of systemtap, I find the IO thread cannot get CPU time. I can found one cpu with 100% sys(do sys_clone), at the same time, the other cpu is idle. — Raymond
sys_clone may block some kernel mutex while doing dup_mm (pi_lock or mm->mmap_sem), and your I/O thread needs to lock the same mutex/semaphore. Try to get stack (kernel or user) for second thread... (Are you sure that only fork may work for you? Try vfork+exec = posix_spawn, if there is exec just after fork.) — osgx

osgx osgx · Accepted Answer · 2014-05-03T03:15:11

rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds

While doing fork (or clone) the vmas of existing process should be copied into vmas of new process. dup_mm function (kernel/fork.c) creates new mm and do actual copy. There are no direct calls to copy_page_range, but I think, static function dup_mmap may be inlined into dup_mm and it has calls to copy_page_range.

In the dup_mmap there are several locks locked, both in new mm and old oldmm:

356         down_write(&oldmm->mmap_sem);

After taking the mmap_sem reader/writer semaphore, there is a loop over all mmaps to copy their metainformation:

381         for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)

Only after the loop (it is long in your case), mmap_sem is unlocked:

465 out:
468         up_write(&oldmm->mmap_sem);

While the rwlock mmap_sep is down by writer, no any other reader or writer can do anything with mmaps in oldmm.

one thread cannot get cpu time until fork finish So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu?

Are you sure, that other thread is ready to run and not wanting to do anything with mmaps, like:

mmaping something new or unmapping something not needed,
growing or shrinking its heap (brk),
growing its stack,
pagefaulting
or many other activities...?

Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it.

You should check stack of your wait-cpu thread (there is even SysRq for this), and kind of I/O. mmaping of file is the variant of I/O which will be blocked on mmap_sem by fork.

Also you can check the "last used CPU" of the wait-cpu thread, e.g. in the top monitoring utility, by enabling the thread view (H key) and adding "Last used CPU" column to output (fj in older; f scroll to P, enter in newer). I think it is possible that your wait-cpu thread already was on the other CPU, just not allowed (not ready) to run.

If you are using fork only to make exec, it can be useful to:

either switch to vfork+exec (or just to posix_spawn). vfork will suspend your process (but may not suspend your other threads, it is dangerous) until new process will do exec or exit, but execing may be faster than waiting for 65 GB of mmaps to be copied.
or not doing fork from the multithreaded process with several active threads and multi-GB virtual memory. You can create small (without multi-GB mmaped) helper process, communicate with it using ipc or sockets or pipes and ask it to fork and do everything you want.

In multi-core machine, Linux OS, when process scheduler will migrate one process to another cpu

1 Answers