Large byte array - any benefit of storing length within the byte array?

Question

Q: Is there any benefit of storing the length of a large array within the array itself?

Explanation:

Let's say we compress some large binary serialized object by using the GZipStream class of the System.IO.Compression namespace. The output will be a Base64 string of some compressed byte array. At some later point the Base64 string gets converted back to a byte array and the data needs to be decompressed.

While compressing the data we create a new byte array with the size of the compressed byte array + 4. In the first 4 bytes we store the length/size of the compressed byte array and we then BlockCopy the length and the data to the new array. This new array gets converted into a Base64 string.

While decompressing we convert the Base64 string into a byte array. Now we can extract the length of the actual compressed data by using the BitConverter class which will extract a Int32 from the first 4 bytes. We then allocate a byte array with the length that we got from the first 4 bytes and let the Stream write the decompressed bytes to the byte array.

I can't image that something like this actually has any benefit at all. It adds more complexity to the code and more operations need to be executed. Readability is reduced too. The BlockCopy operations alone should consume so much resources that this just cannot have a benefit, right?

Compression example code:

byte[] buffer = new byte[0xffff] // Some large binary serialized object
// Compress in-memory.
using (var mem = new MemoryStream())
{
    // The actual compression takes place here.
    using (var zipStream = new GZipStream(mem, CompressionMode.Compress, true)) {
        zipStream.Write(buffer, 0, buffer.Length);
    }

    // Store compressed byte data here.
    var compressedData = new byte[mem.Length];
    mem.Position = 0;                
    mem.Read(compressedData, 0, compressedData.Length);

    /* Increase the size by 4 to accommadate for an Int32 that
    ** will store the total length of the compressed data. */
    var zipBuffer = new byte[compressedData.Length + 4];
    // Store length of compressedData array in the first 4 bytes.
    Buffer.BlockCopy(compressedData, 0, zipBuffer, 4, compressedData.Length);
    // Store the compressedData array after the first 4 bytes which store the length.
    Buffer.BlockCopy(BitConverter.GetBytes(buffer.Length), 0, zipBuffer, 0, 4);
    return Convert.ToBase64String(zipBuffer);
}

Decompression example code:

byte[] zipBuffer = Convert.FromBase64String("some base64 string");
using (var inStream = new MemoryStream())
{
    // The length of the array that was stored in the first 4 bytes.
    int dataLength = BitConverter.ToInt32(zipBuffer, 0);
    // Allocate array with specific size.
    byte[] buffer = new byte[dataLength];

    // Writer data to buffer array.
    inStream.Write(zipBuffer, 4, zipBuffer.Length - 4);                
    inStream.Position = 0;

    // Decompress data.
    using (var zipStream = new GZipStream(inStream, CompressionMode.Decompress)) {
        zipStream.Read(buffer, 0, buffer.Length);
    }

    ... code
    ... code 
    ... code
}

I can't image that something like this actually has any benefit at all. It can benefit when streams of data are involved. You read the start of the stream - it tells you how many bytes to read until the end of this "block" of data (e.g. variable length string). I believe protobuf works like this, for example, when dealing with Length-delimited so you know when this "block" of data ends. — mjwills
mem.Read(compressedData, 0, compressedData.Length); This line of code worries me. It looks like you assume that if you ask it to read a certain number of bytes then it will do so. That is a dangerous (read: foolish) assumption when dealing with streams. You really should check the return value of that call. — mjwills
What you are doing is exactly like a lot of Linux/Unix operating system create structures. Withe binary data the byte array is proceeded with a length property so when data is moved/sent from one application to another application the receive end of the system know where the data ends. The receive end of a stream (pipe) does not have the object that contains the byte count. — jdweng
Having the lenght in teh Data? You should ask any Network Protocoll if that is a good idea.. The answer is "Yes" for every Frame, Package and DataGram format. If nothing else, just to have a way to avoid buffer overflow attacks with the data. — Christopher

Christopher Christopher · Accepted Answer · 2020-06-03T09:29:21

You tagged the question as C#, wich means .NET, so the question is irrelevant:

The Framework already store the length with the Array. It is how the array classes do the sanity checks on Indexers. It how it prevents overflow attacks in managed code. That help alone is worth any minor inefficiency (note that the JiT is actually able to prune most of the checks. With a loop for example, it will simply look at the running variable once per loop).

You would have to go all the way into unmanaged code and handling naked pointers to have a hope to get rid of it. But why would you? The difference is so small, it falls under the speed rant. If it maters, you propably got a realtime programming case. And starting those with .NET was a bad idea.