Overwrite contents of existing content in Azure Blob Storage

votes

I am using block blobs to append time series data in Azure blob storage using Azure Storage Client. I now want to update contents of the existing blob. The file size could be as large as 800MB.

Is there any way to download blob in chunks based on blockId, change the contents and upload the contents of that blockId?

azureazure-blob-storage

2 Answers

votes

Is there any way to download blob in chunks based on blockId, change the contents and upload the contents of that blockId?

AFAIK, I don't think it is currently possible using the existing APIs. Current API only gives you the block id and the size of the block. For this to work, you would need to store block's metadata (like block id, starting/ending byte range) at some place.

One possible solution (just thinking out loud) would be to utilize blob's metadata to store this block's metadata. You can read the metadata, get the byte range to download, download that data, modify it and then upload it back. Again when uploading, you will need to adjust this metadata about the blocks. But again there's a limit on metadata size (8K bytes).

votes

You can do this with the .NET libraries Microsoft.WindowsAzure.Storage and Microsoft.WindowsAzure.Storage.Blob

Say you wanted to remove the header row from a very large csv files, but only download and and upload the first block:

using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;

namespace RemoveHeaderRow
{
    class Program
    {
        static async Task Main(string[] args)
        {
            var storageAccount = CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net");
            var client = storageAccount.CreateCloudBlobClient();
            var container = client.GetContainerReference("containerName");
            var blockBlob = container.GetBlockBlobReference("blobName.csv");

            var blockList = await blockBlob.DownloadBlockListAsync();
            if (blockList.Count() == 0)
            {
                // not all blocks have a blocklist, here's why: https://stackguides.com/questions/14652172/azure-blobs-block-list-is-empty-but-blob-is-not-empty-how-can-this-be
                return; // cannot proceed
            }
            var firstBlock = blockList.First();

            //  download block
            var contents = await GetBlockBlobContents(blockBlob, firstBlock);

            //  remove first line
            var noHeaderContents = string.Join("\n", contents.Split("\n").Skip(1));

            //  upload block back to azure
            await UpdateBlockBlobContent(blockBlob, firstBlock, noHeaderContents);

            //  commit the blocks, all blocks need to be committed, not just the updated one
            await blockBlob.PutBlockListAsync(blockList.Select(b => b.Name));
        }

        public static async Task<string> GetBlockBlobContents(CloudBlockBlob blockBlob, ListBlockItem blockItem)
        {
            using (var memStream = new MemoryStream())
            using (var streamReader = new StreamReader(memStream))
            {
                await blockBlob.DownloadRangeToStreamAsync(memStream, 0, blockItem.Length);
                memStream.Position = 0;
                return await streamReader.ReadToEndAsync();
            }
        }

        public static async Task UpdateBlockBlobContent(CloudBlockBlob blockBlob, ListBlockItem blockItem, string contents)
        {
            using (var stream = new MemoryStream())
            using (var writer = new StreamWriter(stream))
            {
                writer.Write(contents);
                writer.Flush();
                stream.Position = 0;
                await blockBlob.PutBlockAsync(blockItem.Name, stream, null);
            }
        }
    }
}