2
votes

I'm working with some code that reads data from xlsx files by parsing the xml. It is all pretty straightforward, with the exception of date cell.

Dates are stored as integers and have an "s" attribute that is an index into the stylesheet, which can be used to get a date formatting string. Here are some examples from a previous stackoverflow question that is linked below:

19 = 'h:mm:ss AM/PM';

20 = 'h:mm';

21 = 'h:mm:ss';

22 = 'm/d/yy h:mm';

These are the built in date formatting strings from the ooxml standard, however it seems like excel tends to use custom formatted strings instead of the builtins. Here is an example format from an Excel 2007 spreadsheet. numFmtId greater than 164 is a custom format.

<numFmt formatCode="MM/DD/YY" numFmtId="165"/>

Determining if a cell should be formatted as a date is difficult because the only indicator I can find is the formatCode. This one is obviously a date, but cells could be formatted any number of ways. My initial attempt is to look for Ms, Ds, and Ys in the formatCode, but that seems problematic.

Has anybody had any luck with this problem? It seems like the standard excel reading libraries are lacking in xlsx support at this time. I've read through the standards and have dug through a lot of xlsx files without much luck.

The best information seems to come from this stackoverflow question:

what indicates an office open xml cell contains a date time value

Thanks!

2

2 Answers

5
votes

Dates are stored as integers

In the Excel data model, there is really no such thing as an integer. Everything is a float. Dates and datetimes are floats, representing days and a fraction since a variable epoch. Times are fractions of a day.

It seems like the standard excel reading libraries are lacking in xlsx support at this time.

google("xlsxrd"). To keep up to date, join the python-excel group.

Edit I see that you have already asked a question there. If you had asked a question there as specific as this one, or responded to my request for clarification, you would have this info over two weeks ago.

Have a look at the xlrd documentation. Up the front there is a discussion on Excel dates. All of it applies to Excel 2007 as well as earlier versions. In particular: it is necessary to parse custom formats. It is necessary to have a table of "standard" format indexes which are for date formats. "Standard" formats listed in some places don't include the formats used in CJK locales.

Options for you:

(1) Borrow from the xlrd source code, including the xldate_as_tuple function.

(2) Option (1) + Get the xlsxrd bolt-on kit and borrow from its source code.

(3) [Recommended] Get the xlsxrd bolt-on kit and use it ... you get a set of APIs that operate across Excel versions 2.0 to 2007 and Python versions 2.1 to 2.7.

0
votes

It isn't enough simply to look for Ms, Ds, and Ys in the number format code

[Red]#,##0 ;[Yellow](#,##0)

is a perfectly valid number format, which contains both Y and D, but isn't a date format. I specifically test for any of the standard date/time formatting characters ('y', 'm', 'd', 'H', 'i', 's') that are outside of square braces ('[' ']'). Even then, I was finding that a few false positives were slipping through, mainly associated with accounting and currency formats. Because these typically begin with either an underscore ('_') or a space followed by a zero (' 0') (neither of which I've ever encountered in a date format, I explicitly filter these values out.

A part of my (PHP) code for determining if a format mask is a date or not:

private static  $possibleDateFormatCharacters = 'ymdHis';

//  Typically number, currency or accounting (or occasionally fraction) formats
if ((substr($pFormatCode,0,1) == '_') || (substr($pFormatCode,0,2) == '0 ')) {
    return false;
}
// Try checking for any of the date formatting characters that don't appear within square braces
if (preg_match('/(^|\])[^\[]*['.self::$possibleDateFormatCharacters.']/i',$pFormatCode)) {
    return true;
}

// No date...
return false;

I'm sure that there may still be exceptions that I'm missing, but (if so) they are probably extreme cases