How to read large file with unicode in Python 3

Question

Hello i have a large file that contain unicode characters, and when i try to open it in Python 3 this is the mistake i have.

File "addRNC.py", line 47, in add_rnc()

File "addRNC.py", line 13, in init for value in rawDoc.readline():

File "/usr/local/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 158: invalid continuation byte

And i try everything and didn't work, here is the code:

rawDoc = io.open("/root/potential/rnc_lst.txt", 'r', encoding='utf8')
    result = []
    for value in rawDoc.readline():

        if len(value.split('|')[9]) > 0 and len(value.split('|')[10]) > 0: 
            if value.split('|')[9] == 'ACTIVO' and value.split('|')[10] == 'NORMAL':
                address = ''
                for piece in value.split('|')[4:7]:
                    address += piece
                if value.split('|')[8] != '':
                    rawdate = value.split('|')[8].split('/')
                    _date = rawdate[2]+"-"+rawdate[1]+"-"+rawdate[0]
                else:
                    _date = 'NULL'

                id = db.prepare("SELECT id FROM potentials_reg WHERE(rnc = '%s')"%(value.split('|')[0]))()

                if len(id) == 0:
                    if _date == 'NULL':
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', NULL, '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7], 'true'))()
                    else:
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7],_date, 'true'))()
                else:
                    pass

    db.close()

What makes you think that the file is a Unicode file that’s encoded in UTF-8? Byte 0xD3 is a U+201D ʀɪɢʜᴛ ᴅᴏᴜʙʟᴇ Qᴜᴏᴛᴀᴛɪᴏɴ ᴍᴀʀᴋ in the MacRoman encoding, for example. Does the file validate as UTF-8? — tchrist

Borealid Borealid · Accepted Answer · 2012-02-01T04:51:03

Your file actually contains invalid UTF-8.

When you say "contains unicode characters", you should be aware that Unicode doesn't specify how the characters are represented. So even if the file represents Unicode data, it could be in UTF-8, UTF-16 (UTF-16BE or UTF-16LE, each with or without a BOM), the deprecated UCS-2, or perhaps even one of the more esoteric forms...

Double check that the file is valid; I'd bet that you indeed have a byte 0xD3 (11010011), which must in UTF-8 be the leading byte of a two-byte character, in a follower position (in other words, 0xD3 immediately follows a byte whose binary representation begins with 11 [is greater than 0xC0]).

The most likely reason for this is that your file contains non-ASCII characters, but isn't in UTF-8.

How to read large file with unicode in Python 3

1 Answers