0
votes

I'm using this port of the Mozilla character set detector to determine a file's encoding and then using that to construct a StreamReader. So far, so good.

However, the file format I am reading is an odd one and from time to time it is necessary to skip a number of bytes. That is, a file that is otherwise text, in one or other encoding, will have some raw bytes embedded in it.

I would like to read the stream as text, up to the point that I hit some text that indicates a byte stream follows, then I would like to read the byte stream, then resume reading as text. What is the best way of doing this (balance of simplicity and performance)?

I can't rely on seeking against the FileStream underlying the the StreamReader (and then discarding the buffered data in the latter) because I don't know how many bytes were used in reading the characters up to that point. I might abandon using StreamReader and switch to a bespoke class that uses parallel arrays of bytes and chars, populates the latter from the former using a decoder, and tracks the position in the byte array every time a character is read by using the encoding to calculate the number of bytes used for the character. Yuk.

To further clarify, the file has this format:

[encoded chars][embedded bytes indicator + len][len bytes][encoded chars]...

Where there many be zero one or many blocks of embedded bytes and the blocks of embedded chars may be any length.

So, for example:

ABC:123:DEF:456:$0099[0x00,0x01,0x02,... x 99]GHI:789:JKL:...

There are no line delimiters. I may have any number of fields (ABC, 123, ...) delimited by some character (in this case a colon). These fields may be in various codepages, including UTF-8 (not guaranteed to be single byte). When I hit a $ I know that the next 4 bytes contain a length (call it n), the next n bytes are to be read raw, and byte n + 1 will be another text field (GHI).

1
I have a hard time figuring out what the question is. Can you make that more clear, maybe add a question mark?nvoigt
StreamReader is not a particularly complicated animal. You could implement your own version pretty easily. There's no need for any parallel buffers, just do what most readers do and keep a byte[] buffer, and transfer to/from it as appropriate. Not yucky at all.glenebob
@glenebob - the problem is in 'transfer to it as appropriate'. As I mainly want to parse text, I need to convert from my byte buffer to a char buffer using either an Encoding or a Decoder instance for the detected encoding. If I take x bytes and convert them to chars I will hit the problem if I then need to start reading a fixed number of bytes while in the middle of reading from the char buffer.user2729292
@RonIdaho: Why don't you just read the bytes as bytes, turn it to strings using the appropriate encoding (e.g. UTF-8), and if you have to skip bytes, just skip bytes?Stefan Steinegger
How do you know when you've reached the end of a given string if characters? What is the nature of the "embedded bytes indicator + len"? It sounds as though the reader may need to nibble byte by byte through the buffer constructing a string value, until it reaches a terminator byte. That's precisely what StreamReader.ReadLine() does. It's very straight forward.glenebob

1 Answers

0
votes

Proof of concept. This class works with UTF-16 string data, and ':' delimiters per OP. It expects binary length as a 4-byte, little-endian binary integer. It should be easy to adjust to more specific details of your (odd) file format. For example, any Decoder class should drop in to ReadString() and "just work".

To use it, construct it with a Stream class. For each individual data element, call ReportNextData(), which will tell you what kind of data is next, and then call the appropriate Read*() method. For binary data, call ReadBinaryLength() and then ReadBinaryData().

Note that ReadBinaryData() follows the stream contract; it is not guaranteed to return as many bytes as you asked for, so you may need to call it several times. However, if you ask for too many bytes, it will throw EndOfStreamException.

I tested it with this data (hex format): 410042004300240A0000000102030405060708090024050000000504030201580059005A003A310032003300

Which is: ABC$[10][1234567890]$[5][54321]XYZ:123

Scan the data like so:

OddFileReader.NextData nextData;

while ((nextData = reader.ReportNextData()) != OddFileReader.NextData.Eof)
{
    // Call appropriate Read*() here.
}

public class OddFileReader : IDisposable
{
    public enum NextData
    {
        Unknown,
        Eof,
        String,
        BinaryLength,
        BinaryData
    }

    private Stream source;
    private byte[] byteBuffer;
    private int bufferOffset;
    private int bufferEnd;
    private NextData nextData;
    private int binaryOffset;
    private int binaryEnd;
    private char[] characterBuffer;

    public OddFileReader(Stream source)
    {
        this.source = source;
    }

    public NextData ReportNextData()
    {
        if (nextData != NextData.Unknown)
        {
            return nextData;
        }

        if (!PopulateBufferIfNeeded(1))
        {
            return (nextData = NextData.Eof);
        }

        if (byteBuffer[bufferOffset] == '$')
        {
            return (nextData = NextData.BinaryLength);
        }
        else
        {
            return (nextData = NextData.String);
        }
    }

    public string ReadString()
    {
        ReportNextData();

        if (nextData == NextData.Eof)
        {
            throw new EndOfStreamException();
        }
        else if (nextData != NextData.String)
        {
            throw new InvalidOperationException("Attempt to read non-string data as string");
        }

        if (characterBuffer == null)
        {
            characterBuffer = new char[1];
        }

        StringBuilder stringBuilder = new StringBuilder();
        Decoder decoder = Encoding.Unicode.GetDecoder();

        while (nextData == NextData.String)
        {
            byte b = byteBuffer[bufferOffset];

            if (b == '$')
            {
                nextData = NextData.BinaryLength;

                break;
            }
            else if (b == ':')
            {
                nextData = NextData.Unknown;
                bufferOffset++;

                break;
            }
            else
            {
                if (decoder.GetChars(byteBuffer, bufferOffset++, 1, characterBuffer, 0) == 1)
                {
                    stringBuilder.Append(characterBuffer[0]);
                }

                if (bufferOffset == bufferEnd && !PopulateBufferIfNeeded(1))
                {
                    nextData = NextData.Eof;

                    break;
                }
            }
        }

        return stringBuilder.ToString();
    }

    public int ReadBinaryLength()
    {
        ReportNextData();

        if (nextData == NextData.Eof)
        {
            throw new EndOfStreamException();
        }
        else if (nextData != NextData.BinaryLength)
        {
            throw new InvalidOperationException("Attempt to read non-binary-length data as binary length");
        }

        bufferOffset++;

        if (!PopulateBufferIfNeeded(sizeof(Int32)))
        {
            nextData = NextData.Eof;

            throw new EndOfStreamException();
        }

        binaryEnd = BitConverter.ToInt32(byteBuffer, bufferOffset);
        binaryOffset = 0;
        bufferOffset += sizeof(Int32);
        nextData = NextData.BinaryData;

        return binaryEnd;
    }

    public int ReadBinaryData(byte[] buffer, int offset, int count)
    {
        ReportNextData();

        if (nextData == NextData.Eof)
        {
            throw new EndOfStreamException();
        }
        else if (nextData != NextData.BinaryData)
        {
            throw new InvalidOperationException("Attempt to read non-binary data as binary data");
        }

        if (count > binaryEnd - binaryOffset)
        {
            throw new EndOfStreamException();
        }

        int bytesRead;

        if (bufferOffset < bufferEnd)
        {
            bytesRead = Math.Min(count, bufferEnd - bufferOffset);

            Array.Copy(byteBuffer, bufferOffset, buffer, offset, bytesRead);
            bufferOffset += bytesRead;
        }
        else if (count < byteBuffer.Length)
        {
            if (!PopulateBufferIfNeeded(1))
            {
                throw new EndOfStreamException();
            }

            bytesRead = Math.Min(count, bufferEnd - bufferOffset);

            Array.Copy(byteBuffer, bufferOffset, buffer, offset, bytesRead);
            bufferOffset += bytesRead;
        }
        else
        {
            bytesRead = source.Read(buffer, offset, count);
        }

        binaryOffset += bytesRead;

        if (binaryOffset == binaryEnd)
        {
            nextData = NextData.Unknown;
        }

        return bytesRead;
    }

    private bool PopulateBufferIfNeeded(int minimumBytes)
    {
        if (byteBuffer == null)
        {
            byteBuffer = new byte[8192];
        }

        if (bufferEnd - bufferOffset < minimumBytes)
        {
            int shiftCount = bufferEnd - bufferOffset;

            if (shiftCount > 0)
            {
                Array.Copy(byteBuffer, bufferOffset, byteBuffer, 0, shiftCount);
            }

            bufferOffset = 0;
            bufferEnd = shiftCount;

            while (bufferEnd - bufferOffset < minimumBytes)
            {
                int bytesRead = source.Read(byteBuffer, bufferEnd, byteBuffer.Length - bufferEnd);

                if (bytesRead == 0)
                {
                    return false;
                }

                bufferEnd += bytesRead;
            }
        }

        return true;
    }

    public void Dispose()
    {
        Stream source = this.source;

        this.source = null;

        if (source != null)
        {
            source.Dispose();
        }
    }
}