Linux Driver and API architecture for a data acquisition device

Question

We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).

Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.

We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code

The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.

Here was our first attempt:

Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.

All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.

Other notes:

Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.

I would recommend to look for similar drivers in kernel. Maybe you would get some insights on driver architecture. Try to look for drivers which are simultaneously using the same API as yours (e.g. DMA, PCI, video, etc.). Then investigate code you found and try to understand how you can reuse idea behind that code. When I was writing radio driver for MFD device (from scratch), I did exactly the same: found some implemented drivers and figured how I can split the code in elegant manner and how I can implement interaction between different parts of this driver. — Sam Protsenko
Sorry, a bit of advertisement. ohwr.org/projects/zio This is what I wrote for this specific purpose, it is a generic Linux I/O framework. It is not specialized for video, it just moves data from hardware to user space. Our most demanding board is 1.4Gbps and we are going to have another one at 2.5 Gbps in the next weeks. You can write me an email if you want to evaluate it or need more information (you can find the email on the project site). — Federico
There is a new driver to support Intel Processor Trace technology, which can produce about the huge amount of data (~200MB/s). I would recommend to look at that code. lwn.net/Articles/629480 — 0andriy

EML EML · Accepted Answer · 2016-03-15T20:13:43

You're probably way past this now, but if not here's my 2p.

It's hard to believe that your card can't generate interrupts when it has transferred data. It's got a DMA engine, and it can handle 'descriptors', which are presumably elements of a scatter-gather list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You might get lucky, but I suspect not.

You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.

The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.

And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.

Linux Driver and API architecture for a data acquisition device

1 Answers