rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds
While doing fork (or clone) the vmas of existing process should be copied into vmas of new process. dup_mm function (kernel/fork.c) creates new mm and do actual copy. There are no direct calls to copy_page_range, but I think, static function dup_mmap may be inlined into dup_mm and it has calls to copy_page_range.
In the dup_mmap there are several locks locked, both in new mm and old oldmm:
356 down_write(&oldmm->mmap_sem);
After taking the mmap_sem reader/writer semaphore, there is a loop over all mmaps to copy their metainformation:
381 for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)
Only after the loop (it is long in your case), mmap_sem is unlocked:
465 out:
468 up_write(&oldmm->mmap_sem);
While the rwlock mmap_sep is down by writer, no any other reader or writer can do anything with mmaps in oldmm.
one thread cannot get cpu time until fork finish
So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu?
Are you sure, that other thread is ready to run and not wanting to do anything with mmaps, like:
- mmaping something new or unmapping something not needed,
- growing or shrinking its heap (
brk),
- growing its stack,
- pagefaulting
- or many other activities...?
Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it.
You should check stack of your wait-cpu thread (there is even SysRq for this), and kind of I/O. mmaping of file is the variant of I/O which will be blocked on mmap_sem by fork.
Also you can check the "last used CPU" of the wait-cpu thread, e.g. in the top monitoring utility, by enabling the thread view (H key) and adding "Last used CPU" column to output (fj in older; f scroll to P, enter in newer). I think it is possible that your wait-cpu thread already was on the other CPU, just not allowed (not ready) to run.
If you are using fork only to make exec, it can be useful to:
- either switch to
vfork+exec (or just to posix_spawn). vfork will suspend your process (but may not suspend your other threads, it is dangerous) until new process will do exec or exit, but execing may be faster than waiting for 65 GB of mmaps to be copied.
- or not doing fork from the multithreaded process with several active threads and multi-GB virtual memory. You can create small (without multi-GB mmaped) helper process, communicate with it using ipc or sockets or pipes and ask it to fork and do everything you want.
sys_clonemay block some kernel mutex while doingdup_mm(pi_lockormm->mmap_sem), and your I/O thread needs to lock the same mutex/semaphore. Try to get stack (kernel or user) for second thread... (Are you sure that onlyforkmay work for you? Try vfork+exec =posix_spawn, if there is exec just after fork.) - osgx