6
votes

I am trying to download a large file (>1GB) from one server to another using HTTP. To do this I am making HTTP range requests in parallel. This lets me download the file in parallel.

When saving to disk I am taking each response stream, opening the same file as a file stream, seeking to the range I want and then writing.

However I find that all but one of my response streams times out. It looks like the disk I/O cannot keep up with the network I/O. However, if I do the same thing but have each thread write to a separate file it works fine.

For reference, here is my code writing to the same file:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//List populated here
Parallel.For(0, numberOfStreams, (index, state) =>
{
    try
    {
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
        using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
        {
            using (FileStream fileStream = File.Open(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write))
            {
                fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);
                byte[] buffer = new byte[64 * 1024];
                int bytesRead;
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    if (state.IsStopped)
                    {
                        return;
                    }
                    fileStream.Write(buffer, 0, bytesRead);
                }
            }
        };
    }
    catch (Exception e)
    {
        exception = e;
        state.Stop();
    }
});

And here is the code writing to multiple files:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//List populated here
Parallel.For(0, numberOfStreams, (index, state) =>
{
    try
    {
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
        using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
        {
            using (FileStream fileStream = File.Open(fileName + "." + index + ".tmp", FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write))
            {
                fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);
                byte[] buffer = new byte[64 * 1024];
                int bytesRead;
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    if (state.IsStopped)
                    {
                        return;
                    }
                    fileStream.Write(buffer, 0, bytesRead);
                }
            }
        };
    }
    catch (Exception e)
    {
        exception = e;
        state.Stop();
    }
});

My question is this, is there some additional checks/actions that C#/Windows takes when writing to a single file from multiple threads that would cause the file I/O to be slower than when writing to multiple files? All disk operations should be bound by the disk speed right? Can anyone explain this behavior?

Thanks in advance!

UPDATE: Here is the error the source server is throwing:

"Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." [System.IO.IOException]: "Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." InnerException: "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond" Message: "Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." StackTrace: " at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)\r\n at System.Net.Security._SslStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)\r\n at System.Net.Security._SslStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)\r\n at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)\r\n

5
the only thing that I can see that might cause the writing to a single file to become and or appear sluggish is that you are not flushing the file after each call to fileStream.Write(buffer, 0, bytesRead); - MethodMan
Don't open the same file in each thread. Open it once and use that single instance(make sure more than one thread doesn't write at the same time - you can use lock for it) - EZI
This should work. Post the exception ToString. How fast is the network and how big your timeout? (Note, that Parallel.For is unsuitable because it uses an uncontrollable degree of parallelism. You can only specify a maximum.) - usr
@usr the server I am getting the file from is what throws the exception. It states 'SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'. I'm taking this to mean that the receiver is not reading bytes from the stream fast enough. - shortspider
@shortspider It generally means you try to connect to a non-existing machine. (If it existed, you would get, either a successful connection or a kind of connection rejected message in a very short time) - EZI

5 Answers

4
votes

Unless you're writing to a striped RAID, you're unlikely to experience performance benefits by writing to the file from multiple threads concurrently. In fact, it's more likely to be the opposite – the concurrent writes would get interleaved and cause random access, incurring disk seek latencies that makes them orders of magnitude slower than large sequential writes.

To get a sense of perspective, look at some latency comparisons. A sequential 1 MB read from disk takes 20 ms; writes take approximately the same time. Each disk seek, on the other hands, takes around 10 ms. If your writes are interleaved at 4 KB chunks, then your 1 MB write will require an additional 2560 ms of seek time, making it 100 times slower than sequential.

I would suggest only allowing one thread to write to the file at any time, and use parallelism just for the network transfer. You can use a producer–consumer pattern where downloaded chunks are written to a bounded concurrent collection (such as BlockingCollection<T>), which then get picked up and written to disk by a dedicated thread.

2
votes
    fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);

That Seek() call is a problem, you'll seek to a part of the file that's very far removed from the current end-of-file. Your next fileStream.Write() call forces the file system to extend the file on disk, filling the unwritten parts of it with zeros.

This can take a while, your thread will be blocked until the file system is done extending the file. Might well be long enough to trigger a timeout. You'd see this go wrong early at the start of the transfer.

A workaround is to create and fill the entire file before you start writing real data. Otherwise a very common strategy used by downloaders, you might have seen .part files before. Another nice benefit is that you have a decent guarantee that the transfer cannot fail because the disk ran out of space. Beware that filling a file with zeros is only cheap when the machine has enough RAM. 1 GB should not be a problem on modern machines.

Repro code:

using System;
using System.IO;
using System.Diagnostics;

class Program {
    static void Main(string[] args) {
        string path = @"c:\temp\test.bin";
        var fs = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Write);
        fs.Seek(1024L * 1024 * 1024, SeekOrigin.Begin);
        var buf = new byte[4096];
        var sw = Stopwatch.StartNew();
        fs.Write(buf, 0, buf.Length);
        sw.Stop();
        Console.WriteLine("Writing 4096 bytes took {0} milliseconds", sw.ElapsedMilliseconds);
        Console.ReadKey();
        fs.Close();
        File.Delete(path);
    }
}

Output:

Writing 4096 bytes took 1491 milliseconds

That was on an fast SSD, a spindle drive is going to take much longer.

1
votes

Here's my guess from the information given so far:

On Windows, when you write to a position that extends the file size Windows needs to zero initialize everything that comes before it. This prevents old disk data to leak which would be a security problem.

Probably, all but your first thread need to zero-init so much data that the download times out. This is not really streaming anymore because the first write takes ages.

If you have the LPIM privilege you can avoid zero initialization. Otherwise you cannot for security reasons. Free Download Manager shows a message that it starts zero-initing at the start of each download.

1
votes

So after trying all the suggestions I ended up using a MemoryMappedFile and openening a stream to write to the MemoryMappedFile on each thread:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//Ranges list populated here
using (MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(fileName, FileMode.OpenOrCreate, null, fileSize.Value, MemoryMappedFileAccess.ReadWrite))
{
    Parallel.For(0, numberOfStreams, index =>
    {
        try
        {
            HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
            using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
            {
                using (MemoryMappedViewStream fileStream = mmf.CreateViewStream(ranges[index].Item1, ranges[index].Item2 - ranges[index].Item1 + 1, MemoryMappedFileAccess.Write))
                {
                    responseStream.CopyTo(fileStream);
                }
            };
        }
        catch (Exception e)
        {
            exception = e;
        }
    });
}
0
votes

System.Net.Sockets.NetworkStream.Write

The stack trace shows that the errors happens when writing to the server. It is a timeout. This can happen because of

  1. network failure/overloading
  2. an unresponsive server.

This is not an issue with writing to a file. Analyze the network and the server. Maybe the server is not ready for concurrent usage.

Prove this theory by disabling writing to the file. The error should remain.