1
votes

I'm parsing a date column that contains irregular date formats that wouldn't be interpreted by pandas'. Dates include different languages for days, months, and years as well as varying formats. The date entries often include timestamps as well. (Bonus: Would separating them by string/regex with lambda/loops be the fastest method?) What's the best option and workflow to tackle these several tens of thousands of date entries?

The entries unknown to pandas and dateutil.parser.

Examples include:

19.8.2017, 21:23:32 
31/05/2015 19:41:56   
Saturday, 18. May  
11 - 15 July 2001 
2019/4/28 下午6:29:28 
1 JuneMay 2000    
19 aprile 2008 21:16:37 GMT+02:00 
Samstag, 15. Mai 2010 20:55:10   
So 23 Jun 2007 23:45 CEST                                       
28 August 1998                                                 
30 June 2001    
1 Ноябрь 2008 г. 18:46:59  
Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) 
May-28-11 6:56:08 PM
Sat Jun 26 2010 21:55:54 GMT+0200 (West-Europa (zomertijd))
lunedì 5 maggio 2008 9.30.33 

"ValueError: ('Unknown string format:', '1 JuneMay 2000')"

I realize this may be a cumbersome and undesirable task. Luckily the dates are currently nonessential to my project so they may be left alone, but a solution would be favorable. Any and all replies are appreciated, thank you.

1

1 Answers

1
votes

Line by line, lots of your dates works:

>>> pd.to_datetime('19.8.2017, 21:23:32')
Timestamp('2017-08-19 21:23:32')

But there are many matters:

  • as your format is irregular, pandas cannot guess if 01-02-2019 is the first of february 2019 or the second of january 2019, I don't know if you can,
  • some of your example cannot be converted into date Saturday, 18. May: which year?
  • there are month and date in different languages (aprile seems Italian, Samstag is German)
  • some of your example works without the parenthesis content:
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200')  # works
Timestamp('2011-06-18 19:46:46-0200', tz='pytz.FixedOffset(-120)')

>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ')  # doesn't work.
...
ValueError: ('Unknown string format:', 'Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ')

It is sure that you cannot have all the date into timestamp, I would try to create a new column with the correctly parsed date in timestamp and the other saved as NaT.

For example:

date
02-01-2019
Saturday, 18. May

will become:

date               new date
02-01-2019         Timestamp('2019-01-02 00:00:00.00)
Saturday, 18. May  NaT

For this I would delete the parenthesis in the initial column:

df2 = df.assign(
    date2=lambda x: x['date'].str.split('(')[0],
    new_date=lambda x: x['date2'].apply(lambda y: pd.to_datetime(y, errors='coerce'), axis='columns')) # apply the function row by row
# This will work with python >= 3.6

After, you can see what's remaining with keeping NaT values.

For translation, you can try to replace words but it will be really long.

This is really slow (due to the apply row by row) but if your data are not consistent you cannot work directly on a column.

I hope it will help.