8
votes

Python 2.7.10
Tried pandas 0.17.1 -- function read_excel
Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()

When using pandas in Python is it possible to read excel files (formats: xls|xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime or timestamp types?

If this is not possible using pandas can someone suggest an alternate method/library to read xls|xlsx files and leave date column values as strings?

For the pandas solution attempts the df.info() and resultant date column types are shown below:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117 entries, 0 to 116
Columns: 176 entries, Mine to Index
dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
memory usage: 161.8+ KB
>>> type(df['Start Date'][0])
Out[6]: pandas.tslib.Timestamp
>>> type(df['End Date'][0])
Out[7]: pandas.tslib.Timestamp

Attempt/Approach 1:

def read_as_dataframe(filename, ext):
   import pandas as pd
   if ext in ('xls', 'xlsx'):
      # problem: date columns auto converted to datetime.datetime or timestamp!
      df = pd.read_excel(filename) # unwanted - date columns converted!

   return df, name, ext

Attempt/Approach 2:

import pandas as pd
# import datetime as datetime
# parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
parse_date = lambda x: x
elif ext in ('xls', 'xlsx', ):
    df = pd.read_excel(filename, parse_dates=False)
    date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
    # problem: date columns auto converted to datetime.datetime or timestamp!
    df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)

And have also tried pyexcel library but it does the same auto-magic convert behavior:

Attempt/Approach 3:

import pyexcel as pe
import pyexcel.ext.xls
import pyexcel.ext.xlsx

t0 = time.time()
if ext == 'xlsx':
    records = pe.get_records(file_name=filename)
    for record in records:
        print("start date = %s (type=%s), end date = %s (type=%s)" %
              (record['Start Date'],
               str(type(record['Start Date'])),
               record['End Date'],
               str(type(record['End Date'])))
              )
3
I don't understand your question, if you don't pass date_cols or a date_parser then it won't try to parse the date strings - EdChum
As you can see from my code snippets I am not passing date_cols or date_parser for first style approach above and pandas is auto-magically (and silently) converting the columns to datetime values. I need to find a way to stop that and leave any column containing a date value as a string. - MattB
Are you sure, please post df.info() after running your code - EdChum
Thanks @EdChum will explore csv approach further. The spreadsheets themselves are out of my control (format + content) so I need to work with formats provided to me and I am trying to develop a robust custom method for trying different date formats (time zones) and presenting user with best fit. - MattB
I think @EdChum is correct. The type you get depends on how the data is typed in the excel file itself. If the column is formatted as datetime data there, you will get datetime data from read_excel. - joris

3 Answers

3
votes
  • Using converters{'Date': str} option inside the pandas.read_excel which helps. pandas.read_excel(xlsx, sheet, converters={'Date': str})
  • you can try convert your timestamp back to the original format
    df['Date'][0].strftime('%Y/%m/%d')
3
votes

I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:

def undate(x):
    if pd.isnull(x):
        return x
    try:
        return x.strftime('%d/%m/%Y')
    except AttributeError:
        return x
    except Exception:
        raise

for i in list_of_possible_date_columns:
    df[i] = df[i].apply(undate)
3
votes

I tried saving the file in a CSV UTF-8 format (manually) and used pd.read_csv() and worked fine.

I tried a bunch of things to figure the same thing with read_excel. Did not work anything for me. So, I am guessing read_excel is probably updating your string in a datetime object which you can not control.