2
votes

I am now trying to handle a large file (several GB), so I am thinking to use multi-thread. The file is multiple lines of data like:

data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 attr2.2 attr2.3
data3 attr3.1 attr3.2 attr3.3

I am thinking to use one thread read multiple lines first to a buffer1, and then one other thread to handle the data in buffer1 line by line, while the reading thread start to read file to buffer2. Then the handling thread continues when buffer2 is ready, and the reading thread read to buffer1 again.

Now I finished the handler part by using freads for small file (several KB), but I am not sure how to make the buffer contains the complete line instead of splitting part of line at end of the buffer, which is like this:

data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 att

Also, I find that the fgets or ifstream getline can read file line by line, but would it be very costly since it has many IOs?

Now I am struggling to figure out what it the best way to do that? Is there any efficient way to read multiple lines at one time? Any advice is appreciated.

2
Read in a bunch of bytes into a buffer. From the end of that buffer, search backwards until you find a newline. That's the last full line read in. Add a special case for end-of-file and that should about cover it.Cameron
also the OS will be doing all sorts of fancy async read aheads for you. You really dont gain much by having one thread do the read and another doing the parsing. Of course the real test is to build several different designs and measure the perfpm100
Did you take into account the approach of loading everything into a database? And then do the processing on the data stored inside the database? Or alternatively take a two-pass approch with the first pass building an index storing all lines' offsets.alk

2 Answers

1
votes

C stdio and C++ iostream functions use buffered I/O. Small reads only have function-call and locking overhead, not read(2) system call overhead.

Without knowing the line length ahead of time, fgets has to either use a buffer or read one byte at a time. Luckily, the C/C++ I/O semantics allow it to use buffering, so every mainstream implementation does. (According to the docs, mixing stdio and I/O on the underlying file descriptors gives undefined results. This is what allows buffering.)

You're right that it would be a problem if every fgets required a system call.


You might find it useful for one thread to read lines and put the lines into some kind of data structure that's useful for the processing thread.

If you don't have to do much processing on each line, doing the I/O in the same thread as the processing will keep everything in the L1 cache of that CPU, though. Otherwise data will end up in L1 of the I/O thread, and then have to make it to L1 of the core running the processing thread.


Depending on what you want to do with your data, you can minimize copying by memory-mapping the file in-place. Or read with fread, or skip the stdio layer entirely and just use POSIX open / read, if you don't need your code to be as portable. Scanning a buffer for newlines migh have less overhead than what the stdio functions do.

You can handle the leftover line at the end of the buffer by copying it to the front of the buffer, and calling the next fread with a reduced buffer size. (Or, make your buffer ~1k bigger than the size of your fread calls, so you can always read multiples of the memory and filesystem page size (typically 4kiB), unless the trailing part of the line is > 1k.)

Or use a circular buffer, but reading from a circular buffer means checking for wraparound every time you touch it.

0
votes

It all depends what you want to do as processing afterwards : do you need to keep a copy of the lines ? Do you intend to process input as std::strings ? etc...

Here some general remarks that could help you further:

  • istream::getline() and fgets() are buffered operations. So I/O is already reduced and you could assume the performance is already correct.

  • std::getline() is also buffered. Nevertheless, if you don't need to process std::strings the function would cost you a considerable number of memory allocation/deallocation, which might impact performance

  • Bloc operations like read() or fread() can achieve economies of scale if you can afford large buffers. This can be especially efficient, if you use the data in a throw-away fashion (because you can avoid copying the data and work directly in the buffer), but at the cost of an extra complexity.

But all these considerations shall not forget that the erformance is very much affected by the library implementation that you use.

I've done a little informal benchmark reading a milion of lines in the format you've shown: * With MSVC2015 on my PC the read() is twice as fast as fgets(), and almost 4 times faster than std::string. * With GCC on CodingGround, compiling with O3, fgets(), and both getline() are approximately the same, and the read() is slower.

Here the full code if you want to play around.

Here the the code that show you how to move the buffer arround.

int nr=0;         // number of bytes read
bool last=false;  // last (incomplete) read
while (!last)
{
    // here nr conains the number of bytes kept from incomplete line
    last = !ifs.read(buffer+nr, szb-nr); 
    nr = nr+ifs.gcount(); 
    char *s, *p = buffer, *pe = p + nr;
    do {  // process complete lines in buffer
        for (s = p; p != pe && *p != '\n'; p++)
            ;
        if (p != pe || (p == pe && last)) {
            if (p != pe)
                *p++ = '\0';
            lines++; // TO DO:  here s is a null terminated line to process
            sln += strlen(s);   // (dummy operatio for the example)
        }
    } while (p != pe);  // until eand of buffer is reached
    std::copy(s, pe, buffer);  // copy last (incoplete) line to begin of buffer
    nr = pe - s;    // and prepare the info for the next iteration
}