3
votes

I am building an application that needs to allow users to upload large images (up to about 100 MB) to the Windows Azure Blob Storage service. Having read Rob Gillen's excellent article on file upload optimization for Windows Azure, I borrowed his approach for doing parallel upload of file chunks, using the CloudBlockBlob.PutBlock() method within a Parallel.For loop (code is available here).

The problem I have is that whenever I try to upload a file I get an "InvalidMd5" exception from the storage client. Suspecting that the problem may be in the development storage, I also tried running the code against my live Azure storage account, but I got the same error. Looking at the traffic with Fiddler I see that the "Content-MD5" header is set to a valid MD5 hash. The description of the error says that "The MD5 value specified in the request is invalid. The MD5 value must be 128 bits and Base64-encoded.", but to the best of my knowledge the value I see being sent in Fiddler is valid (e.g. a91c588092cedbdb1b82c2d3786fd509).

Here is the code I use for calculating the hash (courtesy of Rob Gillen):

public static string GetMD5HashFromStream(byte[] data)
{
    MD5 md5 = new MD5CryptoServiceProvider();
    byte[] retVal = md5.ComputeHash(data);

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < retVal.Length; i++)
    {
        sb.Append(retVal[i].ToString("x2"));
    }
    return sb.ToString();
}

And this is the actual call to PutBlock():

blob.PutBlock(transferDetails[j].BlockId, new MemoryStream(buff), blockHash, options);

I also tried passing the hash like so:

Convert.ToBase64String(Encoding.UTF8.GetBytes(blockHash))

but the result was the same - "InvalidMd5" error :(

The MD5 hash being passed to PutBlock() with base64 encoding (e.g. YTkxYzU4ODA5MmNlZGJkYjFiODJjMmQzNzg2ZmQ1MDk=) and without it (e.g. a91c588092cedbdb1b82c2d3786fd509) doesn't seem to make a difference.

Rob's code obviously worked for him and I really have no idea what may be causing the problem in my case. The only change I've made to Rob's code is to alter the ParallelUpload() extension method to take a Stream instead of a file name and to dynamically determine the block size depending on the size of the file being uploaded.

Please, if anyone has an idea how to solve this problem, let me know! I will be really grateful! I already lost two days struggling with this.

2

2 Answers

3
votes

Rob, thank you for offering to help and pointing out the difference in the MD5 hashes. Your answer got me thinking in the right direction. I spent another whole day digging into this but luckily (and thanks to your remark :)) I finally managed to resolve the problem. It turned out there were actually two issues in my case:

1) The MD5 hash: I noticed the hash you pasted in your answer is shorter than the one I was getting but it took me a while to see yours was exactly twice shorter. After some experimentation I found out that the GetMD5HashFromStream() method from your test application is converting the 16-byte hash generated by the MD5CryptoServiceProvider to a 32-character string. And it was this 32-character string that was causing the problem because it was converted to Base64 and passed to the PutBlock() method, hence the twice longer and thus invalid hash that the blob storage service was complaining about. Here is the code I ended up with:

Original:

public static string GetMD5HashFromStream(byte[] data)
{
    MD5 md5 = new MD5CryptoServiceProvider();
    byte[] retVal = md5.ComputeHash(data);

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < retVal.Length; i++)
    {
        sb.Append(retVal[i].ToString("x2"));
    }
    return sb.ToString();
}

and the call to PutBlock():

// calculate the block-level hash
string blockHash = Helpers.GetMD5HashFromStream(buff);
blob.PutBlock(transferDetails[j].BlockId, new MemoryStream(buff), blockHash, options);

Final:

MD5 md5 = new MD5CryptoServiceProvider();
byte[] blockHash = md5.ComputeHash(buff);
string convertedHash = Convert.ToBase64String(blockHash, 0, 16);
blob.PutBlock(transferDetails[j].BlockId, new MemoryStream(buff), convertedHash, options);

Rob, I'm really curious how your code worked in your case and why it didn't in mine - is it something specific to the setup on my machine, or perhaps a differing version of the Azure tools (I'm using v1.2)... Please let me know if you have any idea.

2) A bug in the development storage: lots of combing through the web led me to this page that mentions an obscure but apparently known bug in the development storage:

If two requests attempt to upload a block to a blob that does not yet exist in development storage, one request will create the blob, and the other may return status code 409 (Conflict), with storage services error code BlobAlreadyExists.

Here is what I came up with to work around it:

public static bool IsDevelopmentStorageRunning()
{
    return new Microsoft.ServiceHosting.Tools.DevelopmentStorage.DevStore().IsRunning();
}

You will need to add a reference to Microsoft.ServiceHosting.Tools.dll, which was located in "C:\Program Files\Windows Azure SDK\v1.2\bin" on my machine. Then, I use this method before the Parallel.For loop that processes the file chunks as follows:

bool isDevStorageRunning = StorageProxy.IsDevelopmentStorageRunning();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = isDevStorageRunning ? 1 : 4;
Parallel.For(0, transferDetails.Length, parallelOptions, j => { ... });

I hope this will save someone all the hassles I went through. Rob, thank you once again for helping out :)

3
votes

tishon,

After seeing this post, I went back and re-tested my code, and I'm thinking that there is a problem with the data being passed (possibly what you are passing into the function?).

One thing that jumped out at me immediately was the md5 hash you provided... in every case I've tested, my md5 hashes end with two equals signs like the following (captured from fiddler):

Content-MD5: D1Mxthoqhlwm9cC0729mWA==

I'm not a crypto expert, but I know from working with the block IDs for block blobs, that if you have invalid/unsafe characters in your blob ID prior to converting it a base64 encoded value you'll get invalid data and block ids that Azure can't interpret.