how to do mmap for cacheable PCIe BAR

Question

I am trying to write a driver with custom mmap() function for PCIe BAR, with the goal to make this BAR cacheable in the processor cache. I am aware this is not the best way to achieve highest bandwidth and that the order of writes is unpredictable (neither are the issues in this case).

This is similar to what is described in How would one prevent MMAP from caching values?

The processor is Sandy Bridge i7, PCIe device is Altera Stratix IV dev. board.

First, I tried to do it on CentOS 5 (2.6.18). I changed the MTRR settings to make sure the BAR is not within uncacheable MTRR and used io_remap_pfn_range() with _PAGE_PCD and _PAGE_PWT bits cleared. Reads worked as expected: reads returned correct values and second read to the same address does not necessarily cause the read to go to PCIe (read counter was checked in FPGA). However, the writes caused the system to freeze and then reboot without any messages in the logs or on the screen.

Second, I tried to do it on CentOS 6 (2.6.32), which has PAT support. The result is the same: reads work correctly, writes cause system freeze and reboot. Interestingly, non-temporal/write-combining full cache line writes (AVX/SSE) work as expected, i.e. they always go to FPGA and FPGA observes full cache line writes, reads return correct values afterwards. However, simple 64-bit writes still cause system freeze/reboot.

I also tried to ioremap_cache() and then iowrite32() inside the driver code. The result is the same.

I think it is a hardware issue but I would appreciate if somebody can share any ideas about what's going on.

EDIT: I was able to capture MCE message on CentOS 6: Machine Check Exception: 5 Bank 5: be2000000003110a.

I also tried the same code on 2-socket Sandy Bridge (Romley): reads and non-temporal write behavior is the same, simple writes do not cause MCE/crash but have no effect on system state, i.e. value in memory does not change.

Also, I tried the same code on older 2-socket Nehalem system: simple writes also cause MCE, although the codes are different.

John D McCalpin John D McCalpin · Accepted Answer · 2013-06-06T17:52:43

I am not aware of any x86 hardware that supports the WriteBack (WB) memory type for MMIO addresses, and you are almost certainly seeing a result of that incompatibility. I have posted a discussion of this topic on my blog at http://blogs.utexas.edu/jdm4372/2013/05/29/ and http://blogs.utexas.edu/jdm4372/2013/05/30/

In those postings, I discuss a method that works on some processors -- map the MMIO range twice -- once for store operations from the processor to the FPGA using the Write-Combining (WC) memory type, and once for reads from the processor to the FPGA using the Write Protect (WP) or Write Through (WT) types. You will need to maintain coherence manually by using CLFLUSH on cache lines in the "read only" region when you write to the alias of that line in the "write only" region. You will also need to maintain coherence manually with respect to changes in the values in the FPGA memory, since IO devices cannot generate cache invalidation transactions for MMIO addresses.

My team did this a few years ago when I was at AMD, and am now trying to figure out how to do it with newer Linux kernels and with Intel processors. Linux does not directly support WP or WT memory types with its pre-defined mapping functions, so some hacking is required.... It is fairly easy to override the MTRR for a region, but I am having more trouble finding the correct place(s) in the descendents of the remap_pfn_range() function that I need to change in order to get the WP or WT attribute set in the PAT entries for the range.

This method is probably better suited for FPGAs than for other (pre-defined) types of IO devices, since the programmability of the FPGA allows the flexibility to define the PCI BARs to operate in this double-mapped mode and to cooperate with the processor-side driver in maintaining cache coherence.

how to do mmap for cacheable PCIe BAR

1 Answers