0
votes

I downloaded file and file is OK in excel, but in notepad it's format is wrong

I spent a lot of time, but can't solve error

Link to file https://drive.google.com/open?id=1TDh81zdOggOexdaTxeiGz7r7jSkVqLEG

My code:

#to support encodings
# -*- coding: utf-8 -*-

import codecs

path = "badcode.xlsx"

#read input file
with codecs.open(path, 'r', encoding = 'cp1251') as file:
  lines = file.read()

#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
  file.write(lines)

I have error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 668: character maps to

What I did:

path = "badcode.xlsx"
with open(path) as f:
    print(f)

Returns

<_io.TextIOWrapper name='badcode.xlsx' mode='r' encoding='cp1251'>

1
The algorithm determining the codepage got it wrong. This must be cp1252 instead (a code page commonly used by Office products).Jongware
@usr2564301 I tried cp1252, but received error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>"alcoder
According to my link, that code indeed is not valid for cp1252 – similar to how 98 is not valid for cp1251. If you have no idea whether this is an English (or a nearby European language), or possibly Greek, Cyrillic, Hebrew, or other encoding, you can always try latin-1. That's a unique encoding in that it maps straight to Unicode codes 0000 to 00FF.Jongware
Okay, I now see what you think you were doing. You cannot open an XLSX file as if it's plain text. That is the root cause of all of your problems with this file. (For example: character encoding? opening it in Notepad?) You need the original software (Excel itself) or a dedicated library to read binary XLSX files, not a simple 'encoding'.Jongware
Just for the record: my proposed latin-1 encoding indeed works on your file: with codecs.open(path, 'r', encoding = 'latin-1') as file: lines = file.read() print (lines) does not yield DecodeError anymore. That is because – as I said above – all bytes map directly to valid Unicode characters. But it's of no use for you because this is for text files, not binaries.Jongware

1 Answers

0
votes

This code helps me

    import pandas as pd

    sourceFileName = "badcode.xlsx"
    df = pd.read_excel (sourceFileName)