5
votes

I am writing a Linux kernel driver for a custom USB device which will use bulk endpoints, everything seems to work fine, however, I am getting very slow data rates. Specifically, it takes ~25 seconds to write and read 10MB worth of data. I tried this on an embedded system and a Linux VM running on a reasonable PC with similar results.

I am using a EZ-USB FX2 development kit from Cypress as the target board. It is running the bulkloop firmware which sets up two in and two out endpoints. Each endpoint is double buffered and supports 512 byte windows. The firmware polls out endpoints via a while(1) loop in main(), no sleep, and copies data from out to in endpoints when those data are available using autopointers. I have been told that this can move data fairly on Windows using their specific application but have not had a chance to verify this.

My code (relevant portions below) calls a function called bulk_io in the device probe routine. This function creates a number (URB_SETS) of out urbs which attempt to write 512 bytes to the device. Changing this number between 1 and 32 doesn't change performance. They are all copying from the same buffer. The callback handler for each write operation to an out endpoint is used to create a read urb on the corresponding in endpoint. The read callback creates another write urb until I have hit the total number of write/read requests that I want to run at a time (20,000). I am working now to push most of the operations in the callback functions into bottom halves in case they are blocking other interrupts. I am also thinking of rewriting the bulk-loop firmware for the Cypress FX2 to use interrupts instead of polling. Is there anything here that looks out of the ordinary to make the performance so low? Thank you in advance. Please let me know if you would like to see more code, this is just a bare-bone driver to test I/O to the Cypress FX2.

This is the out endpoint write callback function:

static void bulk_io_out_callback0(struct urb *t_urb) {
    // will need to make this work with bottom half
    struct usb_dev_stat *uds = t_urb->context;
    struct urb *urb0 = usb_alloc_urb(0,GFP_KERNEL);
    if (urb0 == NULL) {
            printk("bulk_io_out_callback0: out of memory!");
    }
    usb_fill_bulk_urb(urb0, interface_to_usbdev(uds->intf), usb_rcvbulkpipe(uds->udev,uds->ep_in[0]), uds->buf_in, uds->max_packet, bulk_io_in_callback0, uds);
    usb_submit_urb(urb0,GFP_KERNEL);
    usb_free_urb(urb0);
}

This is the in endpoint read callback function:

static void bulk_io_in_callback0(struct urb *t_urb) {
    struct usb_dev_stat *uds = t_urb->context;

    struct urb *urb0 = usb_alloc_urb(0,GFP_KERNEL);
    if (urb0 == NULL) {
            printk("bulk_io_out_callback0: out of memory!");
    }

    if (uds->seq--) {
            usb_fill_bulk_urb(urb0, interface_to_usbdev(uds->intf), usb_sndbulkpipe(uds->udev,uds->ep_out[0]), uds->buf_out, uds->max_packet, bulk_io_out_callback0, uds);
            usb_submit_urb(urb0,GFP_KERNEL);
    }
    else {
            uds->t1 = get_seconds();
            uds->buf_in[9] = 0; // to ensure we only print the first 8 chars below
            printk("bulk_io_in_callback0: completed, time=%lds, bytes=%d, data=%s\n", (uds->t1-uds->t0), uds->max_packet*SEQ, uds->buf_in);
    }
    usb_free_urb(urb0);
}

This function gets called to set up the initial urbs:

static int bulk_io (struct usb_interface *interface, struct usb_dev_stat *uds) {
    struct urb *urb0;
    int i;

    uds->t0 = get_seconds();

    memcpy(uds->buf_out,"abcd1234",8);

    uds->seq = SEQ; // how many times we will run this

    printk("bulk_io: starting up the stream, seq=%ld\n", uds->seq);

    for (i = 0; i < URB_SETS; i++) {
            urb0 = usb_alloc_urb(0,GFP_KERNEL);
            if (urb0 == NULL) {
                    printk("bulk_io: out of memory!\n");
                    return(-1);
            }

            usb_fill_bulk_urb(urb0, interface_to_usbdev(uds->intf), usb_sndbulkpipe(uds->udev,uds->ep_out[0]), uds->buf_out, uds->max_packet, bulk_io_out_callback0, uds);
                            printk("bulk_io: submitted urb, status=%d\n", usb_submit_urb(urb0,GFP_KERNEL));
            usb_free_urb(urb0); // we don't need this anymore
    }


    return(0);
}

Edit 1 I verified that udev->speed == 3, so USB_SPEED_HIGH, meaning this is not because Linux thinks this is a slow device....

Edit 2 I moved everything in the callbacks related to urb creation (kmalloc, submit) and freeing into bottom halves, same performance.

1
So the mystery is no more. I modified the CY7C68013A 'bulkloop' firmware to toggle a GPIO when it is moving data/arming endpoints and it was spending ~80% of it's cycles doing that function. It looks like having the 8051 core touch the USB buffers at all reduces throughput to ~0.5MB/s as shown above. I went ahead and benchmarked with their CyUSB lib windows bulkloop demo and got much worse performance, around 0.1MB/s. In conclusion, using the bulkloop firmware is not a good test of USB driver performance. Will try it with a FPGA feeding the CY7C68013A data next.armguy
just a small point here, you should keep URB callbacks as small as possible (eg. just setting a flag)arash kordi

1 Answers

1
votes

Reading and writing in small chunks is in my experience not very effective.

I am using a EZ-USB FX2 development kit from Cypress as the target board. It is running the bulkloop firmware which sets up two in and two out endpoints. Each endpoint is double buffered and supports 512 byte windows.

This is does not mean that you can write no more than 512 bytes to it at a time.

I would try writing no less than 4096 bytes to it at a time, because that is the standard pages size (perhaps not so standard in embedded systems). If that worked, I would try writing as much as 1/4 of a megabyte to it at a time, and then even more if that worked.

The key point here is knowing when that the writing window of the device is full. When it is - it will call whatever callbacks, or you get that information by any other means and use it in signaling your application to stop writing.

Note that the window will not be full after you "give the device" 512 bytes, because the device will start reading from this window as soon as there is anything to read.

Perhaps I've missed something important in your question, but what I'm saying is essentially you have to write more than 512 bytes at a time. This is why you get such poor performance.