An atomic operation is, classically, an op code that does a Test-and-Set. Basically this would test a value in memory and, if it is zero (for example), increment it. Whilst this is going on the CPU won't let any other core access that location, and the Test-and-Set is guaranteed to complete uninterrupted. So when called, the end result is either the value has been incremented and your program goes down a particular branch, or it hasn't and your program goes down a different branch.
Not all CPUs have one of these - the 68000 family did, the PowerPC family did not (I think - corrections welcome. I know it was a right nuisance in PowerPC VME systems, where as the 68000 based previous generation machines could to a test-and-set on remote boards), pretty sure earlier X86s didn't either. I'm pretty sure that all major modern CPUs do - it's very useful.
Effectively a Test-and-Set gives you a counting semaphore, and that is what they're used for. However, with only a little bit of chicanery in a library it can also be used as a mutex (which is a binary semaphore that can be given only by the thread that took it).
AFAIK semaphores and mutexes are implemented these days making use of the Test-and-Set op codes available on the CPUs. However, on platforms where there is no Test-and-Set op code, its behaviour has to be synthesised by the OS, probably involving an ISR, interrupt disabling, and so forth. The end result behaves the same, but it's considerably slower. Also on these platforms an "atomic" has to be synthesised using mutexs to guard the value.
So I suspect that talk of mutexes serialising at the kernel level is referring to systems where a mutex has been implemented by the kernel, and the atomic operations are supported by the CPU.
It's also worth remembering that calls to take / give mutexes involve kernel making scheduling decisions, even if the OS then goes on to use a CPU test-and-set op code to implement the mutual exclusion part of the mutex. Whereas calling a test-and-set op code directly from within your program does not; the kernel has no idea it's even happened. So a mutex is a good way to ensure that high priority threads run first if there is contention, whereas a test-and-set op code likely is not (it will be a first come, first served thing). This will be because the CPU has no concept of thread priority, that's an abstract concept dreamed up by by the OS developers.
You can learn a lot about how this kind of thing is done by rooting around inside the source code for the Boost C++ library. Things like shared pointers depend on mutual exclusion, and Boost can implement mutual exclusion in a number of different ways. For example, using test-and-set style op-codes on platforms that have them, or by using the POSIX mutex library function calls, or if you tell it that there is only 1 thread in your program it won't bother at all.
It's worthwhile for Boost to implement its own mutual exclusion mechanisms using op-codes where it can; it doesn't need it to function inter-process (simply inter-thread), whereas a full-on POSIX mutex is inter-process, overkill for Boost's requirements.
With Boost you can override the default selection using a few #defines. So you can speed up a single threaded program by getting it compiled without the mutual exclusion in shared pointers. That is occasionally genuinely useful. What I don't know is if that's been lost in C++ 11 and onwards, now that they've absorbed smart pointers and made them their own.
EDIT
It's also worth taking a look at futexes, which is what Linux uses as the underpinnings for mutexes, semaphores, etc. The idea of a futex is to use atomic operations to implement the bulk of the functionality entirely in user space, resorting to system calls only when absolutely necessary. The result is that, so long as there's not too much contention, a high level thing like a mutex or semaphore is very much more efficient than in the bad old days when they always used to result in a system call. FUTEXes have been around in Linux since about 2003, so we've had the benefit of them for 15 years now. Basically there's no point worrying too much about the efficiency of mutexes vs atomic operations - they're not too far off being the same thing .
What's likely more important is to aim for clean, tidy source code that's easy to read and using the library calls that help with that. Using, say, atomic operations over mutexes at the expense of simple source code is likely not worth it. Certainly on platforms like VxWorks, which don't really have the concept of kernel / user space in the first place and are engineered around lightning-fast context switch times, one can afford to be profligate with the use of mutexes and semaphores to achieve simplicity.
For example, using a mutex to control which thread had access to a particular network socket is a way to use the kernel and thread priorities to manage the priorities of different types of message being sent through that socket. The source code is beautifully simple - threads merely take / give the mutex around using the socket, and that's all there is. No queue manager, no prioritisation decision making code, nothing. All of that is done by the OS scheduling threads in response to mutex takes / gives. On VxWorks this ends up being pretty efficient, benefited from the OS resolving priority inversions, and took very little time to develop. On Linux, especially one with the PREEMPT_RT patch set applied and running as real time priority threads, it's also not too bad (because that also resolves priority inversions, something that I gather Linus doesn't much care for). Whereas on an OS that doesn't have FUTEXes underpinning mutexes and also has expensive context switch times (e.g. Windows), it would be inefficient.