42
votes

I understand that a user can own a process and each process has an address space (which contains valid memory locations, this process can reference). I know that a process can call a system call and pass parameters to it, just like any other library function. This seems to suggest that all system calls are in a process address space by sharing memory, etc. But perhaps, this is only an illusion created by the fact that in high level programming language, system calls look like any other function, when a process calls it.

But, now let me take a step deeper and analyze more closely on what happens under the hood. How does compiler compile a system call? It perhaps pushes the system call name and parameters supplied by the process in a stack and then put the assembly instruction say "TRAP" or something -- basically the assembly instruction to call a software interrupt.

This TRAP assembly instruction is executed by hardware by first toggling the mode bit from user to kernel and then setting the code pointer to say beginning of interrupt service routines. From this point on, the ISR executes in kernel mode, which picks up the parameters from the stack (this is possible, because kernel has access to any memory location, even the ones owned by user processes) and executes the system call and in the end relinquishes the CPU, which again toggles the mode bit and the user process starts from where it left off.

Is my understanding correct?

Attached is rough diagram of my understanding: enter image description here

6

6 Answers

15
votes

Your understanding is pretty close; the trick is that most compilers will never write system calls, because the functions that programs call (e.g. getpid(2), chdir(2), etc.) are actually provided by the standard C library. The standard C library contains the code for the system call, whether it is called via INT 0x80 or SYSENTER. It'd be a strange program that makes system calls without a library doing the work. (Even though perl provides a syscall() function that can directly make system calls! Crazy, right?)

Next, the memory. The operating system kernel sometimes has easy address-space access to the user process memory. Of course, protection modes are different, and user-supplied data must be copied into the kernel's protected address space to prevent modification of user-supplied data while the system call is in flight:

static int do_getname(const char __user *filename, char *page)
{
    int retval;
    unsigned long len = PATH_MAX;

    if (!segment_eq(get_fs(), KERNEL_DS)) {
        if ((unsigned long) filename >= TASK_SIZE)
            return -EFAULT;
        if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
            len = TASK_SIZE - (unsigned long) filename;
    }

    retval = strncpy_from_user(page, filename, len);
    if (retval > 0) {
        if (retval < len)
            return 0;
        return -ENAMETOOLONG;
    } else if (!retval)
        retval = -ENOENT;
    return retval;
}

This, while it isn't a system call itself, is a helper function called by system call functions that copies filenames into the kernel's address space. It checks to make sure that the entire filename resides within the user's data range, calls a function that copies the string in from user space, and performs some sanity checks before the returning.

get_fs() and similar functions are remnants from Linux's x86-roots. The functions have working implementations for all architectures, but the names remain archaic.

All the extra work with segments is because the kernel and userspace might share some portion of the available address space. On a 32-bit platform (where the numbers are easy to comprehend), the kernel will typically have one gigabyte of virtual address space, and user processes will typically have three gigabytes of virtual address space.

When a process calls into the kernel, the kernel will 'fix up' the page table permissions to allow it access to the whole range, and gets the benefit of pre-filled TLB entries for user-provided memory. Great success. But when the kernel must context switch back to userspace, it has to flush the TLB to remove the cached privileges on kernel address space pages.

But the trick is, one gigabyte of virtual address space is not sufficient for all kernel data structures on huge machines. Maintaining the metadata of cached filesystems and block device drivers, networking stacks, and the memory mappings for all the processes on the system, can take a huge amount of data.

So different 'splits' are available: two gigs for user, two gigs for kernel, one gig for user, three gigs for kernel, etc. As the space for the kernel goes up, the space for user processes goes down. So there is a 4:4 memory split that gives four gigabytes to the user process, four gigabytes to the kernel, and the kernel must fiddle with segment descriptors to be able to access user memory. The TLB is flushed entering and exiting system calls, which is a pretty significant speed penalty. But it lets the kernel maintain significantly larger data structures.

The much larger page tables and address ranges of 64 bit platforms probably makes all the preceding look quaint. I sure hope so, anyway.

9
votes

Yes, you've got it pretty much right. One detail though, when the compiler compiles a system call, it will use the number of the system call rather than the name. For example, here is a list of Linux syscalls (for an old version, but the concept is still the same).

4
votes

You actually call the C runtime library. It's not the compiler who inserts TRAP, it's the C library who wraps TRAP into a library call. The rest of your understanding is correct.

3
votes

If you wanted to perform a system call directly from your program, you could easily do so. It is platform dependent, but let's say you wanted to read from a file. Every system call has a number. In this case you place the number of the read_from_file system call in register EAX. The arguments for the system call is placed in different registers or the stack (depending on system call). After the registers are filled with the correct data and you are ready to perform the system call, you execute the instruction INT 0x80 (depends on architecture). That instruction is an interrupt which causes the control to go to the OS. The OS then identifies the system call number in the register EAX, acts accordingly and gives control back to the process doing the system call.

The way system calls are used are prone to change and depends on the given platform. By using libraries that provides easy interfaces to these system calls, you make your programs more platform independent and your code will be much more readable and faster to write. Consider implementing system calls directly in a high level language. You would need something like inline assembly to ensure data are put in the right registers.

3
votes

Normal programs usually do not "compile syscalls". For each syscall you usually a corresponding userspace library function (usually implemented in libc on Unix-like systems). For example, the mkdir() function forwards its arguments to the mkdir syscall.

On GNU systems (I guess it's the same for others), a syscall() function is used from the 'mkdir()' function. The syscall function/macros are usually implemented in C. For example have a look at INTERNAL_SYSCALL in sysdeps/unix/sysv/linux/i386/sysdep.h or syscall in sysdeps/unix/sysv/linux/i386/sysdep.S (glibc).

Now if you look at sysdeps/unix/sysv/linux/i386/sysdep.h, you can see that the call to the kernel is done by ENTER_KERNEL which historically was to call interrupt 0x80 in i386 CPUs. Now it calls a function (I guess it is implemented in linux-gate.so which is a virtual SO file mapped by the kernel, it contains the most efficient way to make a syscall for your type a CPU).

0
votes

Yes, your understanding absolutely right, a C program can call direct system call, when that system call happens it can be a series of calls till assembly Trap. I think immensely your understanding can help a newbie.Check this code in which I am calling "system" system call.

#include < stdio.h  >    
#include < stdlib.h >    
int main()    
{    
    printf("Running ps with "system" system call ");    
    system("ps ax");    
    printf("Done.\n");    
    exit(0);    
}