I am writing a Linux kernel driver for a custom USB device which will use bulk endpoints, everything seems to work fine, however, I am getting very slow data rates. Specifically, it takes ~25 seconds to write and read 10MB worth of data. I tried this on an embedded system and a Linux VM running on a reasonable PC with similar results.
I am using a EZ-USB FX2 development kit from Cypress as the target board. It is running the bulkloop firmware which sets up two in and two out endpoints. Each endpoint is double buffered and supports 512 byte windows. The firmware polls out endpoints via a while(1) loop in main(), no sleep, and copies data from out to in endpoints when those data are available using autopointers. I have been told that this can move data fairly on Windows using their specific application but have not had a chance to verify this.
My code (relevant portions below) calls a function called bulk_io in the device probe routine. This function creates a number (URB_SETS) of out urbs which attempt to write 512 bytes to the device. Changing this number between 1 and 32 doesn't change performance. They are all copying from the same buffer. The callback handler for each write operation to an out endpoint is used to create a read urb on the corresponding in endpoint. The read callback creates another write urb until I have hit the total number of write/read requests that I want to run at a time (20,000). I am working now to push most of the operations in the callback functions into bottom halves in case they are blocking other interrupts. I am also thinking of rewriting the bulk-loop firmware for the Cypress FX2 to use interrupts instead of polling. Is there anything here that looks out of the ordinary to make the performance so low? Thank you in advance. Please let me know if you would like to see more code, this is just a bare-bone driver to test I/O to the Cypress FX2.
This is the out endpoint write callback function:
static void bulk_io_out_callback0(struct urb *t_urb) {
// will need to make this work with bottom half
struct usb_dev_stat *uds = t_urb->context;
struct urb *urb0 = usb_alloc_urb(0,GFP_KERNEL);
if (urb0 == NULL) {
printk("bulk_io_out_callback0: out of memory!");
}
usb_fill_bulk_urb(urb0, interface_to_usbdev(uds->intf), usb_rcvbulkpipe(uds->udev,uds->ep_in[0]), uds->buf_in, uds->max_packet, bulk_io_in_callback0, uds);
usb_submit_urb(urb0,GFP_KERNEL);
usb_free_urb(urb0);
}
This is the in endpoint read callback function:
static void bulk_io_in_callback0(struct urb *t_urb) {
struct usb_dev_stat *uds = t_urb->context;
struct urb *urb0 = usb_alloc_urb(0,GFP_KERNEL);
if (urb0 == NULL) {
printk("bulk_io_out_callback0: out of memory!");
}
if (uds->seq--) {
usb_fill_bulk_urb(urb0, interface_to_usbdev(uds->intf), usb_sndbulkpipe(uds->udev,uds->ep_out[0]), uds->buf_out, uds->max_packet, bulk_io_out_callback0, uds);
usb_submit_urb(urb0,GFP_KERNEL);
}
else {
uds->t1 = get_seconds();
uds->buf_in[9] = 0; // to ensure we only print the first 8 chars below
printk("bulk_io_in_callback0: completed, time=%lds, bytes=%d, data=%s\n", (uds->t1-uds->t0), uds->max_packet*SEQ, uds->buf_in);
}
usb_free_urb(urb0);
}
This function gets called to set up the initial urbs:
static int bulk_io (struct usb_interface *interface, struct usb_dev_stat *uds) {
struct urb *urb0;
int i;
uds->t0 = get_seconds();
memcpy(uds->buf_out,"abcd1234",8);
uds->seq = SEQ; // how many times we will run this
printk("bulk_io: starting up the stream, seq=%ld\n", uds->seq);
for (i = 0; i < URB_SETS; i++) {
urb0 = usb_alloc_urb(0,GFP_KERNEL);
if (urb0 == NULL) {
printk("bulk_io: out of memory!\n");
return(-1);
}
usb_fill_bulk_urb(urb0, interface_to_usbdev(uds->intf), usb_sndbulkpipe(uds->udev,uds->ep_out[0]), uds->buf_out, uds->max_packet, bulk_io_out_callback0, uds);
printk("bulk_io: submitted urb, status=%d\n", usb_submit_urb(urb0,GFP_KERNEL));
usb_free_urb(urb0); // we don't need this anymore
}
return(0);
}
Edit 1 I verified that udev->speed == 3, so USB_SPEED_HIGH, meaning this is not because Linux thinks this is a slow device....
Edit 2 I moved everything in the callbacks related to urb creation (kmalloc, submit) and freeing into bottom halves, same performance.