1
votes
import os
import shutil
import codecs


directory = '~/Desktop/ra/clean_tokenized/1987'

for filename in os.listdir(directory):
    full_name = directory + '/' + filename
    with open(full_name, 'r') as article:
        for line in article:
            print(line)

Here's the traceback:

Traceback (most recent call last): File "~/Desktop/corpus_filter/01_corpus.py", line 11, in for line in article: File "~/.conda/envs/MangerRA/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

The file contains Japanese characters and I'm just trying make a CSV file with all the words that have come up in the files. But I can't get over this error.

1

1 Answers

1
votes

Python is trying to open your file using the UTF-8 encoding (which is the default most of the time these days). Unfortunately, your file is using some other encoding (or is otherwise corrupted), and so the decoding fails.

Unfortunately, I can't tell what encoding your file uses. You'll have to investigate that yourself. You might try another encoding like Shift JIS (using open(full_name, 'r', encoding='shift-jis')), and see if you get valid text or mojibake.

If all else fails, you can open the file in binary mode ('rb' rather than just 'r'), and check out what is located at byte 3131 and immediately afterwards. It may be just a messed up bit of data in the file that you can delete or fix manually.