CPU: Intel Sandy / Ivy Bridge (x86_64)
I need to write a device driver which connected to CPU via PCI Express and need to use the maximum bandwidth. To do this, I'm using the mapped device memory to the physical address space of the processor. Then mapping this memory into the virtual address space of the kernel marked as WC (Write Combined) using ioremap_wc ()
.
As is known in the x86_64 CPU there are several buffers:
- Cache - a well-known fast memory buffer, consisting of three buffers: L1 / L2 / L3 Each level consists of a cache line of 64 bytes.
- In the WB (Write Back) mode - (asynchronous) in the background CPU is writing the data from cache to the RAM by using blocks of 64 bytes in any sequence.
- In the WT (Write Through) mode - (synchronous) each store to the memory by using
MOV [addr], reg
is storing the cache line to the cache and RAM immediately.
Detailed about cache levels: each core has L1 (64 KB, 1 ns) and L2 (256 KB, 3 ns), and whole CPU has one for all cores buffer L3 (4 - 40 MB, 10 ns).
(SB) Store Buffer - a buffer (queue) in which all data is stored sequentially. And in the same sequence the data lazily in the background are stored in memory. But there is an option to force save the data from store buffer to the Cache / RAM by using
SFENCE
orMFENCE
(for example for support sequential consistency between cores).BIU (Bus Interface Unit) / WCB (Write Combining Buffers) - in the WC (Write Combined) mode. When the memory region is marked as WT, the cache is not used, and used BUI / WCB with size 64 bytes as the cache line. And when we store to memory
MOV [addr], reg
by 1 bytes 64 times, then only when last byte has been stored then the whole BIU / WCB stores to the memory - this is optimized mechanism for writing data to the memory area by whole blocks of 64 bytes. An example, it is a very important mechanism for store data to the device memory which mapped to the CPU physical address space through PCI-Express interface, where recording(sending) by 64 bytes increases actual bandwidth in times compared with recording(sending) by 1 byte. But there is an option to force save the data from BIU / WCB to the [remote] memory by usingSFENCE
orMFENCE
.
And some strongly related questions:
1. Do Cache, Store Buffer and BIU/WCB all use the same physical buffer in CPU, but different parts of it, or all of them has separate physical buffers in CPU?
2. If Cache and BIU use the same physical buffer, for example both use parts of Cache-L1, then why SFENCE/MFENCE
has imapct on second, but hasn't on first. And if them has separate physical buffers then why Cache-line and BIU has the same size 64 bytes?
3. Number of cache lines is equal to (65536 / 64) = 1024 for L1, (262144 / 64) = 4096 for L2, and 4 MB / 64 bytes for L3. Size of Store Buffer we don't know. But how many BUIs / WCBs (64 bytes each) we have on a single CPU-Core or on whole CPU?
4. As we can see, the commands SFENCE
or MFENCE
impact on Store Buffer and on BIU / WCB. But does these commands have any impact to the Cache (L1/L2/L3)?