15
votes

I recently encountered an issue in a custom Linux kernel (2.6.31.5, x86) driver where copy_to_user would periodically not copy any bytes to user space. It would return the count of bytes passed to it, indicating that it had not copied anything. After code inspection we found that the code was disabling interrupts while calling copy_to_user which violates it's contract. After correcting this, the issue stopped occurring. Because the issue happened so infrequently, I need to prove that disabling the interrupts caused the issue.

If you look at the code snippet below from arch/x86/lib/usercopy_32.c rep; movsl copies the words to userspace by the count in CX. Size is updated with CX on exit. CX will be 0 if the movsl execute correctly. Because CX is not zero, the movs? instructions must not have executed, in order to fit the definition of copy_to_user and the observed behavior.

/* Generic arbitrary sized copy.  */
#define __copy_user(to, from, size)                 \
do {                                    \
    int __d0, __d1, __d2;                       \
    __asm__ __volatile__(                       \
        "   cmp  $7,%0\n"                   \
        "   jbe  1f\n"                  \
        "   movl %1,%0\n"                   \
        "   negl %0\n"                  \
        "   andl $7,%0\n"                   \
        "   subl %0,%3\n"                   \
        "4: rep; movsb\n"                   \
        "   movl %3,%0\n"                   \
        "   shrl $2,%0\n"                   \
        "   andl $3,%3\n"                   \
        "   .align 2,0x90\n"                \
        "0: rep; movsl\n"                   \
        "   movl %3,%0\n"                   \
        "1: rep; movsb\n"                   \
        "2:\n"                          \
        ".section .fixup,\"ax\"\n"              \
        "5: addl %3,%0\n"                   \
        "   jmp 2b\n"                   \
        "3: lea 0(%3,%0,4),%0\n"                \
        "   jmp 2b\n"                   \
        ".previous\n"                       \
        ".section __ex_table,\"a\"\n"               \
        "   .align 4\n"                 \
        "   .long 4b,5b\n"                  \
        "   .long 0b,3b\n"                  \
        "   .long 1b,2b\n"                  \
        ".previous"                     \
        : "=&c"(size), "=&D" (__d0), "=&S" (__d1), "=r"(__d2)   \
        : "3"(size), "0"(size), "1"(to), "2"(from)      \
        : "memory");                        \
} while (0)

The 2 ideas that I have are:

  1. when the interrupts are disabled, the page fault does not occur and then rep; movs? is skipped without doing anything. The return value would then be CX, or the amount not copied to userspace, as the definition specifies and the behavior observed.
  2. The page fault does occur, but linux can not process it because interrupts are disabled, so the page fault handler skips the instruction, although I don't know how the page fault handler would do this. Again, in this case CX would remain unmodified and the return value would be correct.

Can anyone point me to the sections in the Intel manuals that specify this behavior, or point me to any additional Linux source that could be helpful?

2
you mention that "the code was disabling interrupts". Can you elaborate which interrupts and how?...TheCodeArtist
@TheCodeArtist: write_lock_bh(); was held, which by my understanding disables software interrupts.Edward
@TheCodeArtist: Thanks! your comment made me look into write_lock_bh() much more closely, showing me the way!Edward

2 Answers

9
votes

I've found the answer. My #2 suggestion was correct and the mechanism was right in front of my face. The page fault does happen, but the fixup_exception mechanism is used to provide a exception/continue mechanism. This section adds entries to the exception handler table:

    ".section __ex_table,\"a\"\n"               \
    "   .align 4\n"                 \
    "   .long 4b,5b\n"                  \
    "   .long 0b,3b\n"                  \
    "   .long 1b,6b\n"                  \
    ".previous"                     \

This says: if the IP address is the first entry and an exception is encountered in a fault handler, then set the IP address to the second address and continue.

So if the exception happens at "4:", jump to "5:". If the exception happens at "0:" then jump to "3:" and if the exception happens at "1:" jump to "6:".

The missing piece is in do_page_fault() in arch/x86/mm/fault.c:

/*
 * If we're in an interrupt, have no user context or are running
 * in an atomic region then we must not take the fault:
 */
if (unlikely(in_atomic() || !mm)) {
    bad_area_nosemaphore(regs, error_code, address);
    return;
}

in_atomic returned true because we are in a write_lock_bh() lock! bad_area_nosemaphore eventually does the fixup.

If a page_fault would occur (which was unlikely, because of the concept of the working space) then the function call would fail and jump out of the __copy_user macro, with the uncopied bytes set to size because preemption was disabled.

4
votes

Page faults are not mask-able interrupts. In fact, they are not technically interrupts at all - but rather exceptions, although I agree the difference is more semantic.

The reason your copy_to_user failed when you called it in atomic context with interrupts disabled is because the code has an explicit check for this.

See http://lxr.free-electrons.com/source/arch/x86/lib/usercopy_32.c#L575