Finding dates in string

Question

I'm looking for a fast way in C# to find all dates in a string (the string is a big text, I've to scan for about 200,000 different strings).

since there are many ways to write date (for example 31/12/2012 or Dec 31, 2012 and much more), I'm using this Regex (that should cover almost all frequent ways to write dates):

string findDates = "(?:(\d{1,4})- /.- /.)|(?:(\s\d{1,2})\s+(jan(?:uary){0,1}\.{0,1}|feb(?:ruary){0,1}\.{0,1}|mar(?:ch){0,1}\.{0,1}|apr(?:il){0,1}\.{0,1}|may\.{0,1}|jun(?:e){0,1}\.{0,1}|jul(?:y){0,1}\.{0,1}|aug(?:ust){0,1}\.{0,1}|sep(?:tember){0,1}\.{0,1}|oct(?:ober){0,1}\.{0,1}|nov(?:ember){0,1}\.{0,1}|dec(?:ember){0,1}\.{0,1})\s+(\d{2,4}))|(?:(jan(?:uary){0,1}\.{0,1}|feb(?:ruary){0,1}\.{0,1}|mar(?:ch){0,1}\.{0,1}|apr(?:il){0,1}\.{0,1}|may\.{0,1}|jun(?:e){0,1}\.{0,1}|jul(?:y){0,1}\.{0,1}|aug(?:ust){0,1}\.{0,1}|sep(?:tember){0,1}\.{0,1}|oct(?:ober){0,1}\.{0,1}|nov(?:ember){0,1}\.{0,1}|dec(?:ember){0,1}\.{0,1})\s+([0-9]{1,2})[\s,]+(\d{2,4}))";

with "RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace" tags. also, i tried to pre-compile the regex to make it even more fast.

The problem is that it's very slow (on some text fies more than 2 seconds) Is there any better and efficient way to do this?

Thanks

A simple comment but I would try there regexes one by one. First scan with the first regex and delete the matched words then run the other. It may be faster depending on the input string. — daryal
{0,1} is the same as ?. The change will not speed up, but simplify its for reading a bit. — kirilloid
If you use RegexOptions.ExplicitCapture it will be a little bit faster and you won't have to use those (?: ) groups. — Balazs Tihanyi
@Aliostad, the text was from some random webpage. for example articles from cnn.com — meirlo
@meirlo but how big was it? If you want to improve performance you have to have some hard and fast metrics before even attempting to improve. — Aliostad

Aliostad Aliostad · Accepted Answer · 2012-03-26T12:06:40

It is difficult to come up with an algorithm without testing it. We could recommend something which turns out slower. So really it is trying different options.

Your expression looks a bit verbose but I cannot say it is the cause of the issue. 2 seconds for a big file is OK but not for a smaller file so it is all relative to the size of the work it is doing

One approach I can recommend is having a two stage process.

First one is the screening to fish for the ones most likely matching and the other is to further examine only that section of the file which the match is located. For example, '\d{1,2}\s*,\s*\d{4}' is likely to be part of a date but looking for it is better than looking for all conditions concerning Jan(uary)/Feb(ruary)/Mar(ch)/....

And a small piece of advice: first get the metrics right, do the homework of establishing your base metrics before starting any change.

If you want to improve performance you have to have some hard and fast metrics before even attempting to improve.

Finding dates in string

2 Answers