0
votes

I have the following dates in a text file,

04/20/2009;04/20/09;4/20/09;4/3/09;

Mar-20-2009;Mar 20, 2009;March 20, 2009;Mar. 20, 2009;Mar 20 2009;

20 Mar 2009;20 March 2009;20 Mar. 2009;20 March, 2009;

Mar 20th, 2009;Mar 21st, 2009;Mar 22nd, 2009;

Feb 2009; Sep 2009; Oct 2010;

6/2008;12/2009;

2009;2010

I am trying to match the content inline 5 (Feb 2009; Sep 2009; Oct 2010;) without capturing any of the other dates.

I have written the following regular expression, but its capturing parts of the other dates as well,

expr_5 = re.findall(r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d{4}',date)

out:

Expr list 5 : [(11, ['Mar 2009']), (12, ['March 2009']), (20, ['Feb 2009']), (21, ['Sep 2009']), (22, ['Oct 2010'])]

Note that the number in front of the output is just the index to easily identify the position of the date in the list. How do I get rid of dates index 11 and 12? (They part of the dates from line 3)

Alternatively,

The expression below captures all of the dates on line 3. Is there a way to combine this expression to capture all the dates in line 5 as well (everything from line 3 and line 5)

expr_3 = re.findall(r'\d{2}\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[\s.,]*[\s]?\d{4}',date)

out:

Expr list 3 : [(11, ['20 Mar 2009']), (12, ['20 March 2009']), (13, ['20 Mar. 2009']), (14, ['20 March, 2009'])]

4

4 Answers

1
votes

Try this one.

import re


s = """
04/20/2009;04/20/09;4/20/09;4/3/09;

Mar-20-2009;Mar 20, 2009;March 20, 2009;Mar. 20, 2009;Mar 20 2009;

20 Mar 2009;20 March 2009;20 Mar. 2009;20 March, 2009;

Mar 20th, 2009;Mar 21st, 2009;Mar 22nd, 2009;

Feb 2009; Sep 2009; Oct 2010;

6/2008;12/2009;

2009;2010
"""


reg = re.compile(r"(^|; )\w{3} \d{4}", re.M)
match = ''.join([m.group() for m in reg.finditer(s)])

# gives you the matched string
print(match)

# If you just want to get the dates
dates = match.split('; ')
print(*dates, sep='\n')

Here in the regex pattern, I used \w{3} which matches the words with 3 letters preceded by either a ^ (newline) or the ; .

0
votes

You could simplify by splitting the string into dates using re.split and then you can test each one against a regex that has to match the whole of it. Example:

import re

test_strings = """04/20/2009;04/20/09;4/20/09;4/3/09;
Mar-20-2009;Mar 20, 2009;March 20, 2009;Mar. 20, 2009;Mar 20 2009;
20 Mar 2009;20 March 2009;20 Mar. 2009;20 March, 2009;
Mar 20th, 2009;Mar 21st, 2009;Mar 22nd, 2009;
Feb 2009; Sep 2009; Oct 2010;
6/2008;12/2009;
2009;2010""".split("\n")

pattern = '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s*\d{4}$'

for strng in test_strings:
    for date in re.split('\s*;\s*', strng):
        match = re.match(pattern, date)
        if match:
            print(match.group(0))

Gives:

Feb 2009
Sep 2009
Oct 2010
0
votes

Do you mean you want to get 3 last item from your regex result? Try this:

expr_5 [-3:]

you'll get the output like this:

['Feb 2009', 'Sep 2009', 'Oct 2010']

0
votes

Okay found the answer, (Thanks for all other replies)

with open('Assignment_1_data.txt') as fhandle:
lines = fhandle.read()

for idx,date in enumerate(re.split(';|\n',lines)):
    date = date.lstrip()
    expr_5 = re.findall(r'^(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d{4}',date)
    print(expr_5)

out:
Expr list 5 : [(20, ['Feb 2009']), (21, ['Sep 2009']), (22, ['Oct 2010'])]

The only difference was to add ^ sign to expression.