2
votes

Problem:

I am opening a .xls with pd.read_excel, but I got an error. ("Pack excel file" downloaded from https://cima.aemps.es/cima/publico/nomenclator.html)

df_cima = pd.read_excel("price_tracker/es/support/Presentaciones.xls")

xlrd.biffh.XLRDError: Excel xlsx file; not supported

The suffix of this file is .xls but this error tells me that it is .xlsx

Then I tried to add engine="openpyxl", which is usually used for reading the .xlsx when xlrd version is no longer 1.2.0, then it gives me another error

openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

MY env:

  • pandas version: 1.1.5
  • xlrd version: 2.0.1
  • openpyxl version: 3.0.6

I do not want to change my xlrd version back to 1.2.0, from other answer I see that new version of xlrd support only .xls, but I don't understand why it is not working for my file.

Thanks in advance.

1
Can you try saving the file as xlsx first ? - SeaBean
then change the extension of file from .xls to .xlsx by opening in ms excel or any other excel editor/viewer and saving it in same place/path with same name but different extension i.e ..xlsx - Anurag Dabas
@AnuragDabas I have multiple files per month like this, I would not choose to change it manuelly, could we change .xls to .xlsx directly when we download it with a scrpit ? - Mapotofu
My Python is 3.8. Your situation then must be different combinations of packages causing it. However, it would be very time consuming to sort it out. Just bypass it if you can e.g. by what you mentioned downloading the xlsx file directly or do it like me to convert it manually. - SeaBean
@SeaBean, I tried pandas-1.2.1 and xlrd-2.0.1 first then pandas-1.1.5 and xlrd-2.0.1, and pycharm console is not updated when i switched env. So I just tried again these two combinations (pandas-1.2.1 and xlrd-2.0.1 vs pandas-1.1.5 and xlrd-2.0.1), this time indeed pandas-1.2.1 and xlrd-2.0.1 works well but not pandas-1.1.5 and xlrd-2.0.1, I think there must be an unresolved issue with the combination of pandas-1.1.5 and xlrd-2.0.1, I will try again tomorrow at my office - Mapotofu

1 Answers

1
votes

Your file is not a .xls, I insist! :)

With Pandas 1.1.5 and xlrd 2.1.0

Rename Presentaciones.xls to Presentaciones.xlsx.

import pandas as pd
# Use openpyxl.
df = pd.read_excel(r'X:...\Presentaciones.xlsx', engine='openpyxl')
print(df)

Enjoy! :)

More info

How do I know that your file is a fake .xls and a very real .xlsx? Because openpyxl doesn't work with xls files.

import pandas as pd
df = pd.read_excel(r'X:...\test.xls', engine='openpyxl')
/* 
   ERROR:
   InvalidFileException: openpyxl does not support the old .xls file format, 
   please use xlrd to read this file, or convert it to the more recent .xlsx file format.
*/

And trying to simply rename test.xls to test.xlsx does not work either!

import pandas as pd
df = pd.read_excel(r'X:...\test.xlsx', engine='openpyxl')
/*
    Error:
    OSError: File contains no valid workbook part
*/

History

Beware, the .xlsx extension (detected by pandas) means there may be scripts in this file. Sometimes the extension can lie, so be careful!

The reason why panda stopped supporting xlsx files is that those files are a security hazard and no one was maintaining this part of the code.