memcpy from graphic buffer is slow in Android

Question

I want to capture every frame from a video to make some modification before rendering in Android device, such as Nexus 10. As I know, android uses hardware to decode and render the frame in the specific device, so I should get the frame data from GraphicBuffer, and before rendering the data will be YUV format.

Also I write a static method in AwesomePlayer.cpp to implement that capture frame data / modify the frame / write it back into GraphicBuffer to render.

Here is my demo code

static void handleFrame(MediaBuffer *buffer) {

    sp<GraphicBuffer> buf = buffer->graphicBuffer();

    size_t width = buf->getWidth();
    size_t height = buf->getHeight();
    size_t ySize = buffer->range_length();
    size_t uvSize = width * height / 2;

    uint8_t *yBuffer = (uint8_t *)malloc(ySize + 1);
    uint8_t *uvBuffer = (uint8_t *)malloc(uvSize + 1);
    memset(yBuffer, 0, ySize + 1);
    memset(uvBuffer, 0, uvSize + 1);

    int const *private_handle = buf->handle->data;

    void *yAddr = NULL;
    void *uvAddr = NULL;

    buf->lock(GRALLOC_USAGE_SW_READ_OFTEN | GRALLOC_USAGE_SW_WRITE_OFTEN, &yAddr);
    uvAddr = mmap(0, uvSize, PROT_READ | PROT_WRITE, MAP_SHARED, *(private_handle + 1));

    if(yAddr != NULL && uvAddr != NULL) {

      //memcpy data from graphic buffer
      memcpy(yBuffer, yAddr, ySize);
      memcpy(uvBuffer, uvAddr, uvSize);

      //modify the YUV data

      //memcpy data into graphic buffer
      memcpy(yAddr, yBuffer, ySize);
      memcpy(uvAddr, uvBuffer, uvSize);
    }

    munmap(uvAddr, uvSize);
    buf->unlock();

    free(yBuffer);
    free(uvBuffer);

}

I printed the timestamp for memcpy function, and I realized that memcpy from GraphicBuffer takes much more time than memcpy data into GraphicBuffer. Take the video with resolution 1920x1080 for example, memcpy from GraphicBuffer takes about 30ms, it is unacceptable for normal video play.

I have no idea why it takes so much time, maybe it copies data from GPU buffer, but copy data into GraphicBuffer looks normal.

Could anyone else who is familiar with hardware decode in android take a look at this issue? Thanks very much.

Update: I found that I didn't have to use GraphicBuffer to get the YUV data, I just used hardware decode the video source and storage the YUV data to memory, so that I could get YUV data from memory directly, it's very fast. Actually you could found the similar solution in AOSP source code or open source video display app. I just allocate the memory buffers rather than graphic buffers, and then use the hardware decoder. Sample code in AOSP: frameworks/av/cmds/stagefright/SimplePlayer.cpp

link: https://github.com/xdtianyu/android-4.2_r1/tree/master/frameworks/av/cmds/stagefright

It might be because it's just a lot of data. 1920x1080 is 2.3M per frame for just 8 bits/pixel. If you have full ARGB, expect that goes to 8.2M per frame. Let's say you get even 15 frames per second - you're asking your tablet to move 124Mbytes per second. That's a lot of data to move around and draw on the screen, push to a file or whatever you're doing with it — Martin
Thanks for your reply @Martin, maybe I didn't describe my intention clearly. Firstly, I just deal with the YUV data, and write it into the original buffer after modification. Also, I test that memcpy the same data between buffers which I allocate, it takes so little time, just about 5ms for the same video. So I assume that it takes so long time because it copies data from GPU, but it makes no sense that copy data back into GPU takes not much time, about 6ms or 7ms. It's confused. — NicotIne
Copying data from graphics unit is usually order of magnitude slower, due to the caching model in use. On some platform (here I meant the graphic device and system board) you may not even copy it back to system memory. — Non-maskable Interrupt
Thanks @Calvin, I didn't know much about the hardware level mechanism, so I couldn't find the root cause. But I really tried that copy the frame data(YUV) to system memory(like the above buffers which I allocated with malloc()), modify the data, and copy it back into original GraphicBuffer, it worked well on Nexus 10 except that it dropped a lot of frames due to much time delay. About the caching model, would you like to tell me more? I have no idea about that. Thanks again. — NicotIne
If you can program the logic using shaders, you may keep things at video memory and avoid all the transfers. — Non-maskable Interrupt

Thomas Matthews Thomas Matthews · Accepted Answer · 2014-03-13T17:53:56

Most likely, the data path (a.k.a. databus) from your CPU to the graphics memory is optimized. The path from graphics memory to CPU may not be optimized. Optimizations may include different speed internal databus, level 1 or 2 cache, and wait-states.

The electronics (hardware) has set the maximum speed for transferring data from the Graphics Memory to your CPU. The memory of the CPU is probably slower than your graphics memory, so there may be wait-states in order for the Graphics Memory to match the slower speed of the CPU memory.

Another issue is all the devices sharing the data bus. Imagine a shared highway between cities. To optimize traffic, traffic is only allowed one direction. Traffic signals or people, monitor the traffic. In order to go from City A to City C, one has to wait until the traffic signals or director, clear remaining traffic and give the route City A to City C priority. In hardware terms, this is called Bus Arbitration.

In most platforms, the CPU is transferring data between registers and the CPU memory. This is needed to read and write your variables in your program. The slow route of transferring data is for the CPU to read memory into a register, then write to the Graphics Memory. A more efficient method is to transfer the data without using the CPU. There may exist a device, DMA (Direct Memory Access), which can transfer data without using the CPU. You tell it the source and target memory locations, then start it. It will transfer the data without using the CPU.

Unfortunately, the DMA must share the data bus with the CPU. This means that your data transfer will be slowed by any requests for the data bus by the CPU. It will still be faster than using the CPU to transfer the data as the DMA can be transferring the data while the CPU is executing instructions that don't require the data bus.

Summary
Your memory transfers may be slow if you don't have a DMA device. With or without the DMA, the data bus is shared by multiple devices and traffic arbitrated. This sets the maximum overall speed for transferring data. Data transfer speeds of the memory chips may also contribute to the data transfer rate. Hardware-wise, there is a speed limit.

Optimizations
1. Use the DMA, if possible.
2. If only using CPU, have CPU transfer the largest chunks possible.
This means using instructions specifically for copying memory.
3. If your CPU doesn't have specialized copy instructions, transfer using the word size of the processor.
If the processor has 32-bit words, transfer 4 bytes at a time with 1 word rather than using 4 8-bit copies.
4. Reduce CPU demands and interruptions during the transfer.
Pause any applications; disable interrupts if possible.
5. Divide the effort: Have one core transfer the data while another core is executing your program.
6. Threading on a single core may actually slow the transfer, as the OS gets involved because of scheduling. The thread switching takes time which adds to the transfer time.

memcpy from graphic buffer is slow in Android

1 Answers