0
votes

I have a file that's too big to be read all at once. It must be processed in chunks.

I'm envisioning a background load system that works on two buffers: One for processing, one for reading into (and swapping them all the time).

In Pseudo-Code:

Read Buf1
Mark Buf2 as dirty, so background task will fill it with new data
Process Buf1
if(Reached End of Buf1)
    Block if Buf2 still marked as dirty
    Swap Buf1 <-> Buf2
    Mark Buf2 as dirty
Process Buf1
... and so on.

There are going to be many chunks. Would it be better to put a dedicated reading thread on this or is it "ok" to launch an std::async for every read operation? Because I'm told that these launch their own threads internally, which is expensive.

Yes, it is time critical.

1
Expensive is always relative. If the chunks you read are fairly large and given that the IO is a bottleneck there, the cost of creating a new thread for each chunk could be negligible compared to the other costs. And depending on how the stdlib you use implements std::async the thread use for it might be reused from a thread pool. If you use std::thread you have full control, but you would need to pause your thread and wait for new data, and if you do that the wrong way you might have worse performance, then spawning a new thread for each chunk.t.niese
Every sane std::async implementation uses a thread pool, so it's not bad. It's a good start. But later on I would consider making a global IO worker pool to have control over how many threads are spawned (possibly one pool per physical drive), it's not unusual to have more IO worker threads than total vcores on the machine, and std::async would never do that.Sopel
Yeah, MSVC apparently does. So I guess there's thatThomas B.
If you only has "a" file, either thread or async will be blocked at IO. I'd suggest producer consumer pattern. It's simpler than swapping thread control for parsing/loading.Louis Go

1 Answers

1
votes

You should probably switch to using a memory mapped file if possible. That should solve your issues more or less, having the OS do the 'heavy lifting' for you while also saving on data copies.

Wikipedia page on the subject.

has support for memory mapped files. That one is somewhat easier to use than the different platform specific solutions I know of.