8
votes

I found my MMIO read/write latency is unreasonably high. I hope someone could give me some suggestions.

In the kernel space, I wrote a simple program to read a 4 byte value in a PCIe device's BAR0 address. The device is a PCIe Intel 10G NIC and plugged-in at the PCIe x16 bus on my Xeon E5 server. I use rdtsc to measure the time between the beginning of the MMIO read and the end, a code snippet looks like this:

vaddr = ioremap_nocache(0xf8000000, 128); // addr is the BAR0 of the device  
rdtscl(init); 
ret = readl(vaddr); 
rmb(); 
rdtscl(end);

I'm expecting the elapsed time between (end, init) to be less than 1us, after all, the data traversing the PCIe data link should be only a few nanoseconds. However, my test results show at lease 5.5use to do a MMIO PCIe device read. I'm wondering whether this is reasonable. I change my code to remote the memory barrier (rmb) , but still get around 5 us latency.

This paper mentions about the PCIe latency measurement. Usually it's less than 1us. www.cl.cam.ac.uk/~awm22/.../miller2009motivating.pdf‎ Do I need to do any special configuration such as kernel or device to get lower MMIO access latency? or Does anyone has experiences doing this before?

2

2 Answers

2
votes

5usec is great! Do that in a loop statistically and you might find much much larger values.

There are several reasons for this. BARs are usually non-cacheable and non-prefetchable - check yours using pci_resource_flags(). If the BAR is marked cacheable then cache-coherency - the process of ensuring that all CPUs have the same value cached might be one issue.

Secondly, reading io is always a non-posted affair. The CPU has to stall until it gets permission to communicate on some data bus and stall a bit more until the data arrives on said bus. This bus is made to appear like memory but in actual fact is not and the stall might be a non-interruptable busy wait but its non-productive never-the-less. So i would expect the worst-case latency to be much higher than 5us even before you start to consider task-preemption.

-1
votes

If the NIC card needs to go over the network, maybe through switches, to get the data from a remote host, 5.5us is a reasonable read time. If you are reading a register in the local PCIe device, it should be less than 1us. I don't have any experience with the Intel 10G NIC, but have worked with Infiniband and custom cards.