2
votes

How does one read binary and text from the same file in Python? I know how to do each separately, and can imagine doing both very carefully, but not both with the built-in IO library directly.

So I have a file that has a format that has large chunks of UTF-8 text interspersed with binary data. The text does not have a length written before it or a special character like "\0" delineating it from the binary data, there is a large portion of text near the end when parsed means "we are coming to an end".

The optimal solution would be to have the built-in file reading classes have "read(n)" and "read_char(n)" methods, but alas they don't. I can't even open the file twice, once as text and once as binary, since the return value of tell() on the text one can't be used with the binary one in any meaningful way.

So my first idea would be to open the whole file as binary and when I reach a chunk of text, read it "character by character" until I realize that the text is ending and then go back to reading it as binary. However this means that I have to read byte-by-byte and do my own decoding of UTF-8 characters (do I need to read another byte for this character before doing something with it?). If it was a fixed-width character encoding I would just read that many bytes each time. In the end I would also like the universal line endings as supported by the Python text-readers, but that would be even more difficult to implement while reading byte-by-byte.

Another easier solution would be if I could ask the text file object its real offset in the file. That alone would solve all my problems.

1
Without knowing how the data is delimited there is no way to do this as the binary data values could potentially coincide with utf-8 values.Chad S.
Think about it as an XML file (although its not) where <some_tag_names> indicates that there is switch to binary data. That tag could contain attributes though, and the tags aren't necessarily predefined (the file itself could say some_tag_names is a binary element). The transition to binary is unambiguous but difficult to do since its not like read until a particular character or read n bytes.coderforlife
You can read n bytes..Chad S.
What I meant is I cannot read n bytes and know that is the extent of the text data (which is commonly do for text strings embedded in binary files, they are prefixed with the length in bytes/characters).coderforlife
Why would someone have written an io stream for your arbitrary file format?Chad S.

1 Answers

1
votes

One way might be to use Hachoir to define a file parsing protocol.

The simple alternative is to open the file in binary mode and manually initialise a buffer and text wrapper around it. You can then switch in and out of binary pretty neatly:

my_file = io.open("myfile.txt", "rb")
my_file_buffer = io.BufferedReader(my_file, buffer_size=1) # Not as performant but a larger buffer will "eat" into the binary data 
my_file_text_reader = io.TextIOWrapper(my_file_buffer, encoding="utf-8")
string_buffer = ""

while True:
    while "near the end" not in string_buffer:
        string_buffer += my_file_text_reader.read(1) # read one Unicode char at a time

    # binary data must be next. Where do we get the binary length from?
    print string_buffer
    data = my_file_buffer.read(3)

    print data
    string_buffer = ""

A quicker, less extensible way might be to use the approach you've suggested in your question by intelligently parsing the text portions, reading each UTF-8 sequence of bytes at a time. The following code (from http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python), seems to be a neat way to conservatively read UTF-8 bytes into characters from a binary file:

 def get_next_character(f):
     # note: assumes valid utf-8
     c = f.read(1)
     while c:
         while True:
             try:
                 yield c.decode('utf-8')
             except UnicodeDecodeError:
                 # we've encountered a multibyte character
                 # read another byte and try again
                 c += f.read(1)
             else:
                 # c was a valid char, and was yielded, continue
                 c = f.read(1)
                 break

# Usage:
with open("input.txt","rb") as f:
    my_unicode_str = ""
    for c in get_next_character(f):
        my_unicode_str += c