Linux device driver DMA memory buffer not seen in order by PCIe hardware

Question

I'm developing a device driver for a Xilinx Virtex 6 PCIe custom board. When doing DMA write (from host to device) here is what happens:

user space app:

a. fill buffer with the following byte pattern (tested up to 16kB)
    00 00 .. 00 (64bytes)
    01 01 .. 00 (64bytes)
    ...
    ff ff .. ff (64bytes)
    00 00 .. 00 (64bytes)
    01 01 .. 00 (64bytes)
    etc

b. call custom ioctl to pass pointer to buffer and size

kernel space:

a. retrieve buffer (bufp) with 
    copy_from_user(ptdev->kbuf, bufp, cnt)
b. setup and start DMA 
    b1. //setup physical address
        iowrite32(cpu_to_be32((u32) ptdev->kbuf_dma_addr),
            ptdev->region0 + TDO_DMA_HOST_ADDR);
    b2. //setup transfer size
        iowrite32(cpu_to_be32( ((cnt+3)/4)*4 ), 
            ptdev->region0 + TDO_DMA_BYTELEN);
    b3. //memory barrier to make sure kbuf is in memorry
        mb(); 
    //start dma
    b4. iowrite32(cpu_to_be32(TDO_DMA_H2A | TDO_DMA_BURST_FIXED | TDO_DMA_START),
                ptdev->region0 + TDO_DMA_CTL_STAT);
c. put process to sleep
    wait_res = wait_event_interruptible_timeout(ptdev->dma_queue, 
                            !(tdo_dma_busy(ptdev, &dma_stat)), 
                            timeout);
d. check wait_res result and dma status register and return

Note that the kernel buffer is allocated once at device probe with:
ptdev->kbuf = pci_alloc_consistent(dev, ptdev->kbuf_size, --512kB
                                &ptdev->kbuf_dma_addr);

device pcie TLP dump (obtained through logic analyzer after Xilinx core):

a. TLP received (by the device)
 a1. 40000001 0000000F F7C04808 37900000 (MWr corresponds to b1 above)
 a1. 40000001 0000000F F7C0480C 00000FF8 (MWr corresponds to b2 above)
 a1. 40000001 0000000F F7C04800 00010011 (MWr corresponds to b4 above)

b. TLP sent (by the device)
 b1. 00000080 010000FF 37900000 (MRd 80h DW @ addr 37900000h)
 b2. 00000080 010000FF 37900200 (MRd 80h DW @ addr 37900200h)
 b3. 00000080 010000FF 37900400 (MRd 80h DW @ addr 37900400h)
 b4. 00000080 010000FF 37900600 (MRd 80h DW @ addr 37900600h)
...

c. TLP received (by the device)
 c1. 4A000020 00000080 01000000 00 00 .. 00 01 01 .. 01 CplD 128B
 c2. 4A000020 00000080 01000000 02 02 .. 02 03 03 .. 03 CplD 128B
 c3. 4A000020 00000080 01000000 04 04 .. 04 05 05 .. 05 CplD 128B 
 c4. 4A000020 00000080 01000000 06 06 .. 0A 0A 0A .. 0A CplD 128B  <= 
 c5. 4A000010 00000040 01000040 07 07 .. 07             CplD  64B  <= 
 c6. 4A000010 00000040 01000040 0B 0B .. 0B             CplD  64B  <= 
 c7. 4A000020 00000080 01000000 08 08 .. 08 09 09 .. 09 CplD 128B  <= 
 c8. 4A000020 00000080 01000000 0C 0C .. 0C 0D 0D .. 0D CplD 128B 
.. the remaining bytes are transfered correctly and 
the total number of bytes (FF8h) matches the requested size
signal interrupt

Now this apparent memory ordering error happens with high probality (0.8 < p < 1) and the ordering mismatch happens at different random points in the transfer.

EDIT: Note that the point c4 above would indicate that the memory is not filled in the right order by the kernel driver (I suppose the memory controller fills TLPs with contiguous memory). 64B being the cacheline size maybe this has something to do with cache operations.

When I disable cache on the kernel buffer with,

echo "base=0xaf180000 size=0x00008000 type=uncachable" > /proc/mtrr

the error still happens but much more seldom (p < 0.1 and depends on transfer size)

This only happens on a i7-4770 (Haswell) based machine (tested on 3 identical machine, with 3 boards). I tried kernel 2.6.32 (RH6.5), stock 3.10.28, and stock 3.13.1 with the same results.

I tried the code and device in an i7-610 QM57 based machine and Xeon 5400 machine without any issues.

Any ideas/suggestions are welcome.

Best regards

Claudio

Thomas Thomas · Accepted Answer · 2016-03-16T12:32:20

I know this is an old thread, but the reason for the "errors" is completion reordering. Multiple outstanding read requests don't have to be answered in order. Completions are only in order for the same request. On top of that: there is always the same tag assigned to the requests, which is illegal if the requests are active at the same time.

Linux device driver DMA memory buffer not seen in order by PCIe hardware

3 Answers