Python3 TypeError: a bytes-like object is required, not 'str'

Question

I am trying to follow this OpenCV exercise http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html but got stuck at the step to run mergevec.py (I use Python version instead of .cpp one). I have Python 3 instead of Python 2.x as in the article.

The source for this file is https://github.com/wulfebw/mergevec/blob/master/mergevec.py

The error I got was

Traceback (most recent call last):
  File "./tools/mergevec1.py", line 96, in <module>
    merge_vec_files(vec_directory, output_filename)
  File "./tools/mergevec1.py", line 45, in merge_vec_files
    val = struct.unpack('<iihh', content[:12])
TypeError: a bytes-like object is required, not 'str'

I have tried to follow this python 3.5: TypeError: a bytes-like object is required, not 'str' when writing to a file and used open(f, 'r', encoding='utf-8', errors='ignore') but still no luck.

My modified code is below:

import sys
import glob
import struct
import argparse
import traceback


def exception_response(e):
    exc_type, exc_value, exc_traceback = sys.exc_info()
    lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
    for line in lines:
        print(line)

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-v', dest='vec_directory')
    parser.add_argument('-o', dest='output_filename')
    args = parser.parse_args()
    return (args.vec_directory, args.output_filename)

def merge_vec_files(vec_directory, output_vec_file):


    # Check that the .vec directory does not end in '/' and if it does, remove it.
    if vec_directory.endswith('/'):
        vec_directory = vec_directory[:-1]
    # Get .vec files
    files = glob.glob('{0}/*.vec'.format(vec_directory))

    # Check to make sure there are .vec files in the directory
    if len(files) <= 0:
        print('Vec files to be mereged could not be found from directory: {0}'.format(vec_directory))
        sys.exit(1)
    # Check to make sure there are more than one .vec files
    if len(files) == 1:
        print('Only 1 vec file was found in directory: {0}. Cannot merge a single file.'.format(vec_directory))
        sys.exit(1)


    # Get the value for the first image size
    prev_image_size = 0
    try:
        with open(files[0], 'r', encoding='utf-8', errors='ignore') as vecfile:
            content = ''.join(str(line) for line in vecfile.readlines())
            val = struct.unpack('<iihh', content[:12])
            prev_image_size = val[1]
    except IOError as e:
        f = None
        print('An IO error occured while processing the file: {0}'.format(f))
        exception_response(e)


    # Get the total number of images
    total_num_images = 0
    for f in files:
        try:
            with open(f, 'r', encoding='utf-8', errors='ignore') as vecfile:
                content = ''.join(str(line) for line in vecfile.readlines())
                val = struct.unpack('<iihh', content[:12])
                num_images = val[0]
                image_size = val[1]
                if image_size != prev_image_size:
                    err_msg = """The image sizes in the .vec files differ. These values must be the same. \n The image size of file {0}: {1}\n 
                        The image size of previous files: {0}""".format(f, image_size, prev_image_size)
                    sys.exit(err_msg)

                total_num_images += num_images
        except IOError as e:
            print('An IO error occured while processing the file: {0}'.format(f))
            exception_response(e)


    # Iterate through the .vec files, writing their data (not the header) to the output file
    # '<iihh' means 'little endian, int, int, short, short'
    header = struct.pack('<iihh', total_num_images, image_size, 0, 0)
    try:
        with open(output_vec_file, 'wb') as outputfile:
            outputfile.write(header)

            for f in files:
                with open(f, 'w', encoding='utf-8', errors='ignore') as vecfile:
                    content = ''.join(str(line) for line in vecfile.readlines())
                    data = content[12:]
                    outputfile.write(data)
    except Exception as e:
        exception_response(e)


if __name__ == '__main__':
    vec_directory, output_filename = get_args()
    if not vec_directory:
        sys.exit('mergvec requires a directory of vec files. Call mergevec.py with -v /your_vec_directory')
    if not output_filename:
        sys.exit('mergevec requires an output filename. Call mergevec.py with -o your_output_filename')

    merge_vec_files(vec_directory, output_filename)

Do you know what I did wrong? Thanks.

UPDATE 1

I did this:

content = b''.join(str(line) for line in vecfile.readlines())

I basically added "b" in front. However, now I got a different error:

Traceback (most recent call last): File "./tools/mergevec1.py", line 97, in merge_vec_files(vec_directory, output_filename) File "./tools/mergevec1.py", line 44, in merge_vec_files content = b''.join(str(line) for line in vecfile.readlines()) TypeError: sequence item 0: expected a bytes-like object, str found

Did you try opening the file with mode 'rb' as it recommends in the answer for the question you cited? You may also need to drop the encoding part as you're just reading the bytes and the encoding is irrelevant. — Craig
Yes rb was the original code which also caused error in the first place. I also tried open(files[0], 'r', errors='ignore') based on your suggestion just now and still same error. — HP.
The rb is correct. You are then converting the bytes to a string again in the next line. What is the structure of your file? Why are you joining the lines? What about using content = vecfile.read() instead of the ''.join(...)? — Craig
They are basically binary format (.vec files). Reference: docs.python.org/3.3/tutorial/inputoutput.html — HP.
If they are binary format, then vecfile.read() should solve your problem. Using vecfile.readlines() could mangle the file contents. — Craig

Craig Craig · Accepted Answer · 2017-03-26T00:56:22

As the OP explains, the file contains binary data. In order to work with binary data:

The file should be opened in binary mode, by using 'rb' as the mode in the open call.
After opening the file, use .read() rather than .readlines() to read the data. This avoids possible corruption of the data caused by the way .readlines() handles line ending characters.
Avoid operations such as .join() that convert the byte array into a character array (string).

For the code provided in the question, the section of the code to read the images should be:

for f in files:
    try:
        with open(f, 'rb') as vecfile:
            content = vecfile.read()
            val = struct.unpack('<iihh', content[:12])
            num_images = val[0]
            image_size = val[1]
            if image_size != prev_image_size:
                err_msg = """The image sizes in the .vec files differ. These values must be the same. \n The image size of file {0}: {1}\n 
                    The image size of previous files: {0}""".format(f, image_size, prev_image_size)
                sys.exit(err_msg)

            total_num_images += num_images
    except IOError as e:
        print('An IO error occured while processing the file: {0}'.format(f))
        exception_response(e)

Python3 TypeError: a bytes-like object is required, not 'str'

2 Answers