0
votes

We have a server application which relays file from clientA to clientB, clientC, clientD, etc. We call this kind of file relay as a task. If there are many tasks executing, then the CPU usage would be very very high.

I wonder such high CPU usage phenomenon while executing multiple tasks concurrently is normal or not. Is there any method to decrease the CPU usage in this kind of application?

      //pseudo code
     void service(void){
          while(1){
               ....
               struct timeval timeout;
               timeout.tv_sec = 3;

               ...
               ret = select(maxFd+1, &read_set, NULL, NULL, &timeout);
               if (ret > 0){
                   //get socket from SocketsMap
                   //if fd in SocketsMap and its being set
                   //then receive data from the socket
                   **all fd are in non-blocking mode**
                   receive_data(fd);
               }
          }
     } 

     void receive_data(int fd){
          const int ONE_MEGA = 1024 * 1024;
          char *buffer = new char[ONE_MEGA]; 
          int readn = recv(fd, buffer, ONE_MEGA, 0);

          //handle the data: many steps
          char* DataToProcess = buffer;
          int LenToProcess = readn;
          while(LenToProcess > 0){
              1. scan the data to find the packet header
              2. get the length from the packet then perform checksum 
                 function which will scan every character of the packet 
                 to get a checksum value.
              3. if the packet is valid then add the packet to data queue. 
                 Move the buffer pointer and process the remaining data.
              ......
              LenToProcess -= DataProcessed;
              DataToProcess += DataProcessed; 
          };
     }

As you can see, all the three steps in receive_data() are cpu-intensive operation. Is there any method that we can decrease the CPU usage as more as possible in such kind of operations(except this way: set a very small buffer size such as "char buffer[1024]") ?

The problem here is that our application will be running with another server application on a same machine, thus the FileRelayer application can't consume too much cpu otherwise the other server applicaiton won't work normally--!

[UPDATE]
Here are some pieces of information about the application:
A. There are about 70 threads in this FileServer multithreaded server application, but only one of these is used to receive data from all sockets.
B. All the sockets are in non-blocking mode including the listening socket.
C. High CPU usage (80% - 90%) are found while application is receiving four files of 200 Mega from 4 clients (4 sockets).

Regarding the problem:
We separate the whole receiving flow into two major parts, lets call them FlowA and FlowB. FlowA only receives the data from the sockets. FlowB stands for the part of handling data in receive_data(), like packet slicing etc.. We found FlowA and FlowB will cause high cpu usage respectively.

1) FlowA: Big array (1 Mega) allocated from stack, dipicted by this post. In our test, we leave only FlowA (discards data after receiving them from sockets) and find the CPU usage stays as high as 80-90% for long time. And replacing the "char Buffer[ONE_MEGA]" with "char *buffer = new char[ONE_MEGA]", the CPU usge decreases to 14%.
2) FlowA + FlowB: After we solved the issue in FlowA, we found the CPU usage is still as high as 80% in the whole flow (FlowA + FlowB), though it fluctuates this time.

Setting the receiving buffer to a very small one such as char buffer[1024] will decrease the cpu usage dramatically because each function call it will only process one or two packets, but we worried that the transfer speed will also decrease. So is there any other way to sovlve this problem?

3
High CPU usage under high loads is actually good. Unless the processing algorithm is painfully inefficient, high usage means little waiting time. - Sergey Kalinichenko
@dasblinkenlight thanks for pointing out the essense of cpu usage. I updated the ariticle. - Wallace
Find out what part of your code is actually consuming more time with the bigger buffer. You can for example use oprofile or perftool on Linux to identify bottlenecks in the code. - Mats Petersson
if you have many of these "tasks" running CPU load will increase also because each of them will access memory at very different locations which can cause cache thrashing lowering the overall performance while keeping cpu load high. In such a case, the CPU waits for the cache to be filled or emptied during read write operations. Hence your application will scale sublinear and overall performance can be worse than with a lower number of tasks. - ogni42
@ogni42: Good point. Is the reason for using such a large buffer to simply make it "more efficient"? Then I would definitely try to use a block-size around 4-16KB. That should give you a decent packet size for each fetch from the network, but not so large that it goes way outside the caches - even with a few threads, that should keep it within at least the L2 cache on a modern "big" processor. - Mats Petersson

3 Answers

0
votes

For TCP sockets function receive_data may not work correctly.

The fact that it allocates a new local buffer suggests that this buffer gets destroyed when the function returns. This implies that receive_data cannot handle incomplete messages.

A correct approach is to allocate a buffer for each socket once. Read from the socket into that buffer and then process and discard complete messages in the front of the buffer. Once all complete messages have been consumed, move the tail of the buffer that contains an incomplete message to the front and next time the socket is ready for reading append new bytes to the end of the incomplete message until it gets complete.

0
votes

Yes. The CPU shouldn't be doing much work. The only thing you're really doing is copying bytes around, and that is unnecessary.

0
votes

To illustrate the example with caches, I took my answer to the previous question on similar subject, and added a check-summing piece of code:

#include <iostream>
#include <cstdio>

using namespace std;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

const int M = 1024*1024;
const int S = 8*1024;

void bigstack()
{
    FILE *f = fopen("test.txt", "r");
    unsigned long long time;
    time = rdtsc();
    char buffer[M];

    fread(buffer, M, 1, f);
    int csum = 0;
    for(char i : buffer)
    {
    csum += i;
    }
    time = rdtsc() - time;
    fclose(f);
    cout << "bs: Time = " << time / 1000 << " csum=" << csum << endl;
}


void bigheap()
{
    FILE *f = fopen("test.txt", "r");
    unsigned long long time;
    time = rdtsc();
    char *buffer = new char[M];

    fread(buffer, M, 1, f);
    int csum = 0;
    for(int i = 0; i < M; i++)
    {
    csum += buffer[i];
    }
    delete [] buffer;
    time = rdtsc() - time;
    fclose(f);
    cout << "bh: Time = " << time / 1000 << " csum=" << csum << endl;
}


void smallstack()
{
    FILE *f = fopen("test.txt", "r");
    unsigned long long time;
    time = rdtsc();
    char buffer[S];
    int toread = M;

    int csum = 0;
    while(toread > 0)
    {
    fread(buffer, S, 1, f);
    for(char i : buffer)
    {
        csum += i;
    }
    toread -= S;
    }
    time = rdtsc() - time;
    fclose(f);
    cout << "ss: Time = " << time / 1000 << " csum=" << csum << endl;
}


int main()
{
    for(int i = 0; i < 10; i++)
    {
    bigstack();
    bigheap();
    smallstack();
    }
}

So, now the code is reading the data into the CPU, and then walking through it all. The time it takes to do the large block is about 16% higher than it is for the smaller block. As can be seen below, the time for the large block is around 1400 units of time, and the smaller blocksize, even though it calls fread multiple times, is around 1200 units of time.

Here's a cut-down version of the output:

bs: Time = 1417 csum=89411462
bh: Time = 1428 csum=89411462
ss: Time = 1208 csum=89411462
bs: Time = 1444 csum=89411462
bh: Time = 1415 csum=89411462
ss: Time = 1205 csum=89411462
bs: Time = 1463 csum=89411462
bh: Time = 1409 csum=89411462
ss: Time = 1262 csum=89411462
bs: Time = 1418 csum=89411462
bh: Time = 1441 csum=89411462
ss: Time = 1213 csum=89411462

The reason for this is that the large block will "fight" more with other data items to fit in the CPU cache, so it's slower.