3
votes

CPU: Intel Sandy / Ivy Bridge (x86_64)

I need to write a device driver which connected to CPU via PCI Express and need to use the maximum bandwidth. To do this, I'm using the mapped device memory to the physical address space of the processor. Then mapping this memory into the virtual address space of the kernel marked as WC (Write Combined) using ioremap_wc ().

As is known in the x86_64 CPU there are several buffers:

  1. Cache - a well-known fast memory buffer, consisting of three buffers: L1 / L2 / L3 Each level consists of a cache line of 64 bytes.
    • In the WB (Write Back) mode - (asynchronous) in the background CPU is writing the data from cache to the RAM by using blocks of 64 bytes in any sequence.
    • In the WT (Write Through) mode - (synchronous) each store to the memory by using MOV [addr], reg is storing the cache line to the cache and RAM immediately.

Detailed about cache levels: each core has L1 (64 KB, 1 ns) and L2 (256 KB, 3 ns), and whole CPU has one for all cores buffer L3 (4 - 40 MB, 10 ns).

  1. (SB) Store Buffer - a buffer (queue) in which all data is stored sequentially. And in the same sequence the data lazily in the background are stored in memory. But there is an option to force save the data from store buffer to the Cache / RAM by using SFENCE or MFENCE (for example for support sequential consistency between cores).

  2. BIU (Bus Interface Unit) / WCB (Write Combining Buffers) - in the WC (Write Combined) mode. When the memory region is marked as WT, the cache is not used, and used BUI / WCB with size 64 bytes as the cache line. And when we store to memory MOV [addr], reg by 1 bytes 64 times, then only when last byte has been stored then the whole BIU / WCB stores to the memory - this is optimized mechanism for writing data to the memory area by whole blocks of 64 bytes. An example, it is a very important mechanism for store data to the device memory which mapped to the CPU physical address space through PCI-Express interface, where recording(sending) by 64 bytes increases actual bandwidth in times compared with recording(sending) by 1 byte. But there is an option to force save the data from BIU / WCB to the [remote] memory by using SFENCE or MFENCE.

And some strongly related questions:

1. Do Cache, Store Buffer and BIU/WCB all use the same physical buffer in CPU, but different parts of it, or all of them has separate physical buffers in CPU?

2. If Cache and BIU use the same physical buffer, for example both use parts of Cache-L1, then why SFENCE/MFENCE has imapct on second, but hasn't on first. And if them has separate physical buffers then why Cache-line and BIU has the same size 64 bytes?

3. Number of cache lines is equal to (65536 / 64) = 1024 for L1, (262144 / 64) = 4096 for L2, and 4 MB / 64 bytes for L3. Size of Store Buffer we don't know. But how many BUIs / WCBs (64 bytes each) we have on a single CPU-Core or on whole CPU?

4. As we can see, the commands SFENCE or MFENCE impact on Store Buffer and on BIU / WCB. But does these commands have any impact to the Cache (L1/L2/L3)?

1
What is your CPU model?osgx
@osgx CPU: Intel Sandy / Ivy Bridge (x86_64)Alex
Are you sure your L1 is 64k? Maybe you counted both the data and instructions cachesLeeor
@Leeor Yes, I counted both (L1-data + L1-instructions)Alex

1 Answers

1
votes
  1. Caches, Store Buffers and BIU/WCB are all separate physical structures in the CPU.

  2. Why do Cache-line and BIU have the same size 64 bytes? For convenience and ease of design. And because boundaries between various cacheabilty regions are at least 64 byte aligned.

  3. The number of BIU/WCBs on a single core is not part of the architecture, it is an implementation detail that might even change from stepping to stepping.

  4. SFENCE and MFENCE cause pending stores to be completed, which might cause some cacheable data to be written from CPU store buffers into the cache.

(edit) The L1/L2/L3 caches form a single cache-coherent system which is an short-cut to the external memory.

A fence operation causes pending stores to be written to some particular level of the cache (L1/L2 or L3) depending on the cache inclusion properties implemented in the design. Most typically a fence instruction would cause cacheable data to move from store buffers to L1, but I believe that it is possible for a region of memory to be marked as cacheable in L2 only or L3-only. In that case data would move from store buffer to L2 or L3. (Many MIPS processors support this mode of operation.)

Non-cacheable data would always be written from store buffers/WCBs directly to memory and would never be written to a cache.

I haven't worked on Intel processors since the P6 days, so I don't know implementation details like the number of WCBs or store buffers on current cores.

If you want to know implementation details for a particular Intel core, take a look at Microprocessor Report, or proceedings of the Hot Chips conference. (Both should be available in University libraries.)