Efficient way to read file multiple line one time?

Question

I am now trying to handle a large file (several GB), so I am thinking to use multi-thread. The file is multiple lines of data like:

data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 attr2.2 attr2.3
data3 attr3.1 attr3.2 attr3.3

I am thinking to use one thread read multiple lines first to a buffer1, and then one other thread to handle the data in buffer1 line by line, while the reading thread start to read file to buffer2. Then the handling thread continues when buffer2 is ready, and the reading thread read to buffer1 again.

Now I finished the handler part by using freads for small file (several KB), but I am not sure how to make the buffer contains the complete line instead of splitting part of line at end of the buffer, which is like this:

data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 att

Also, I find that the fgets or ifstream getline can read file line by line, but would it be very costly since it has many IOs?

Now I am struggling to figure out what it the best way to do that? Is there any efficient way to read multiple lines at one time? Any advice is appreciated.

Read in a bunch of bytes into a buffer. From the end of that buffer, search backwards until you find a newline. That's the last full line read in. Add a special case for end-of-file and that should about cover it. — Cameron
also the OS will be doing all sorts of fancy async read aheads for you. You really dont gain much by having one thread do the read and another doing the parsing. Of course the real test is to build several different designs and measure the perf — pm100
Did you take into account the approach of loading everything into a database? And then do the processing on the data stored inside the database? Or alternatively take a two-pass approch with the first pass building an index storing all lines' offsets. — alk

Peter Cordes Peter Cordes · Accepted Answer · 2015-10-30T17:20:23

C stdio and C++ iostream functions use buffered I/O. Small reads only have function-call and locking overhead, not read(2) system call overhead.

Without knowing the line length ahead of time, fgets has to either use a buffer or read one byte at a time. Luckily, the C/C++ I/O semantics allow it to use buffering, so every mainstream implementation does. (According to the docs, mixing stdio and I/O on the underlying file descriptors gives undefined results. This is what allows buffering.)

You're right that it would be a problem if every fgets required a system call.

You might find it useful for one thread to read lines and put the lines into some kind of data structure that's useful for the processing thread.

If you don't have to do much processing on each line, doing the I/O in the same thread as the processing will keep everything in the L1 cache of that CPU, though. Otherwise data will end up in L1 of the I/O thread, and then have to make it to L1 of the core running the processing thread.

Depending on what you want to do with your data, you can minimize copying by memory-mapping the file in-place. Or read with fread, or skip the stdio layer entirely and just use POSIX open / read, if you don't need your code to be as portable. Scanning a buffer for newlines migh have less overhead than what the stdio functions do.

You can handle the leftover line at the end of the buffer by copying it to the front of the buffer, and calling the next fread with a reduced buffer size. (Or, make your buffer ~1k bigger than the size of your fread calls, so you can always read multiples of the memory and filesystem page size (typically 4kiB), unless the trailing part of the line is > 1k.)

Or use a circular buffer, but reading from a circular buffer means checking for wraparound every time you touch it.

Efficient way to read file multiple line one time?

2 Answers