33
votes

how do I import excel data into a dataframe in python.

Basically the current excel workbook runs some vba on opening which refreshes a pivot table and does some other stuff.

Then I wish to import the results of the pivot table refresh into a dataframe in python for further analysis.

import xlrd

wb = xlrd.open_workbook('C:\Users\cb\Machine_Learning\cMap_Joins.xlsm')

#sheetnames
print wb.sheet_names()

#number of sheets
print wb.nsheets

The refreshing and opening of the file works fine. But how do i select the data from the first sheet from say row 5 including header down to last record n.

2
Thanks Andy. I couldn't get the pd.ExcelFile to work with a .xlsm file and couldn't seem to install openpyxl which is supposed to work with .xlsx and .xlsm files. I just changed my vba to save as .xls. everything worked then except the skip_rows=4 part but i just removed the unwanted rows later in python with df2 = df[2:] - IcemanBerlin
Oh. Actually just adjusted skip_rows=4 with skiprows = 4 and that part worked too. great thanks. You get a silver star...gold if the .xlsm file had worked ;) - IcemanBerlin
Whoops my bad! Will add an issue about xlsm files on github. - Andy Hayden

2 Answers

48
votes

You can use pandas' ExcelFile parse method to read Excel sheets, see io docs:

xls = pd.ExcelFile('C:\Users\cb\Machine_Learning\cMap_Joins.xlsm')

df = xls.parse('Sheet1', skiprows=4, index_col=None, na_values=['NA'])

skiprows will ignore the first 4 rows (i.e. start at row index 4), and several other options.

21
votes

The accepted answer is old (as discussed in comments of the accepted answer). Now the preferred option is using pd.read_excel(). For example:

df = pandas.read_excel('C:\Users\cb\Machine_Learning\cMap_Joins.xlsm'), skiprows=[0,1,2,3,4])